**A Simheuristic Algorithm for Solving the Stochastic Omnichannel Vehicle Routing Problem with Pick-Up and Delivery**

#### **Leandro do C. Martins \*, Christopher Bayliss, Pedro J. Copado-Méndez, Javier Panadero and Angel A. Juan**

Internet Interdisciplinary Institute (IN3)–Computer Science Department, Universitat Oberta de Catalunya, 08018 Barcelona, Spain; cbayliss@uoc.edu (C.B.); pcopadom@uoc.edu (P.J.C.-M.); jpanaderom@uoc.edu (J.P.); ajuanp@uoc.edu (A.A.J.)

**\*** Correspondence: leandrocm@uoc.edu

Received: 24 August 2020; Accepted: 15 September 2020; Published: 19 September 2020

**Abstract:** Advances in information and communication technologies have made possible the emergence of new shopping channels. The so-called 'omnichannel' retailing mode allows customers to shop for products online and receive them at home. This paper focuses on the omnichannel delivery concept for the retailing industry, which addresses the replenishment of a set of retail stores and the direct shipment of the products to customers within an integrated vehicle routing formulation. Due to its *NP-Hardness*, a constructive heuristic, which is extended into a biased-randomized heuristic and which is embedded into a multi-start procedure, is introduced for solving the large-sized instances of the problem. Next, the problem is enriched by considering a more realistic scenario in which travel times are modeled as random variables. For dealing with the stochastic version of the problem, a simheuristic algorithm is proposed. A series of computational experiments contribute to illustrate how our simheuristic can provide reliable and low-cost solutions under uncertain conditions.

**Keywords:** omnichannel retail stores; vehicle routing problem; pick-up and delivery; biased-randomized heuristics; simheuristics

#### **1. Introduction**

Today, people are changing their shopping behavior. Recent advances in information and communication technologies have introduced new shopping channels and models, which make possible the expansion of e-commerce and, consequently, the emergence of new challenges in operational research, transportation, and logistics areas. Specifically, regarding modern e-commerce business models, new decision variables and constraints have been incorporated in them, leading to emerging variants of distribution problems in supply chain management.

Unlike brick-and-mortar stores, where salespeople are available to support and help customers to make their purchases, the popularization of mobile devices with access to the Internet has promoted the use of different shopping channels. The online channel is an example that has emerged as a competitive marketing channel to that of traditional retail centers, transforming e-commerce into a global trend and an important tool for every business worldwide [1]. With e-commerce, customers are immersed in an environment of a plethora of information, opinions, and access to a vast combined supply of stock, which together allows them to browse through different stores in an online environment. According to some experts, the online shopping channels were predicted to kill off the physical ones. However, they co-exist and have completely transformed the way customers shop nowadays [2]. The use of a variety of shopping channels is referred to as 'omnichannel retailing,' where, instead of having only the single option of physically visiting a store to buy products, consumers can also buy them via online

shopping to be delivered at their homes. Hence, the company also becomes responsible for delivering goods ordered online to consumers. Therefore, this new retailing environment has introduced the need to design integrated distribution systems for both serving online customers and for replenishing the stocks of retailer stores [3].

This paper focuses on the emerging omnichannel vehicle routing problem (OM-VRP) concept, which is an extension of the classical vehicle routing problem (VRP) in which a two-echelon distribution network is considered. The VRP aims to design cargo vehicle routes with minimum transportation costs, and sometimes the problem also considers facility location costs [4]. The generated routes are designed to distribute goods between depots and a set of final customers during a planning period [5]. Since the OM-VRP tackles simultaneously the replenishment of a set of retailer stores and the direct shipment of the products from those retailer stores to customers in an omnichannel retailing environment, the OM-VRP can be seen as an integrated problem that combines both the VRP and the pick-up and delivery problem (PDP). In the OM-VRP, which was first introduced by Abdulkader et al. [1], a single distribution center supplies a group of retailer stores–i.e., the first-echelon. In turn, this set of retail centers serve a set of online customers–the second-echelon–by using a single fleet of vehicles in both delivery levels in order to reduce transportation costs. Apart from reducing operating costs and improving supply chain competitiveness, the optimization of this two-echelon problem holds the potential to improve customer service levels and to enhance the on-time delivery of customer orders [6]. VRPs and PDPs are frequently considered in the literature as deterministic problems, where customers' demands and travel times are constant values. In real-life, however, it is frequent that both demands and travel times are exposed to some degree of uncertainty. In these scenarios, it is more accurate to model these constant variables as random variables. In this paper, we extend the previous OM-VRP formulation by replacing the deterministic travel times by stochastic ones. To the best of our knowledge, it is the first time that a stochastic version of the OM-VRP has been considered.

Since the VRP and the PDP are both *NP-hard* problems [7,8], the OM-VRP is, consequently, *NP-hard* as well. Therefore, we propose a multi-start procedure, which employs a savings-based [9] biased-randomization heuristic technique [10]. The multi-start approach is able to generate a variety of good and feasible solutions in short computing times. Finally, in order to deal with the stochastic OM-VRP, the multi-start approach is combined with Monte Carlo simulation and extended into a simheuristic algorithm [11]. Simulation-optimization methods, and simheuristics in particular, allow us to properly deal with stochastic versions of combinatorial optimization problems [12,13], such as the one proposed in this work, where travel times are modeled as random variables. The simheuristic approach is also employed to measure the 'reliability' level of the proposed distribution plan, i.e., to measure the probability that the plan can be deployed, without any route failure, in a realistic scenario under uncertain travel times. To the best of our knowledge, this is the first time that a stochastic version of the OM-VRP has been considered in the scientific literature.

The remainder of the paper is arranged as follows: Section 2 presents a brief literature review on related topics; Section 3 describes, in more detail, the addressed problem; Section 4 introduces the proposed solving methodologies; Section 5 presents an analysis of the results and a comparison between the proposed heuristic and another solution methodology; finally, Section 6 highlights the main conclusions of this work and proposes some future lines of research.

#### **2. Literature Review**

With recent advances in information and communication technologies, different shopping channels have emerged which have attracted the attention of customers. Nowadays, customers use different channels to shop, giving them a multichannel shopping environment [14]. The transition from multichannel to omnichannel retailing has been discussed by Verhoef et al. [15] and Hübner et al. [16]. On the one hand, in multichannel retailing, the online and offline channels are treated as separate businesses. On the other hand, the use of both channels in an omnichannel environment is completely

integrated, providing consumers with a seamless experience [17]. Therefore, the customer experience is different although multichannel and omnichannel retailing environments are often considered the same. Beck and Rygl [18] presented a complete categorization and definition of retailer channels: multi, cross, and omnichannel are clearly defined. According to Hübner et al. [3], there are several operations within the omnichannel distribution system which are responsible for its excellence, such as expanding delivery modes, increasing delivery speed, and service levels.

As mentioned, the OM-VRP combines the VRP and a PDP. The classical VRP, originally proposed by Dantzig and Ramser [19], has been extensively studied by practitioners and academics due to its wide applications in several areas. The VRP belongs to a set of *NP-hard* combinatorial optimization problems (COPs) [7]. Therefore, the use of exact algorithms is efficient only for solving small-sized VRP instances. Mostly, these exact approaches are based on the combination of column and cut generation algorithms [20,21]. On the other hand, approximate algorithms, such as metaheuristics, are frequently very efficient for solving large-sized instances of COPs. Several metaheuristics have been proposed to solve the VRP, which include tabu search (TS), genetic algorithms (GA), ant colony optimization (ACO), and some hybrid methodologies [22–25]. Pick-up and delivery problems have also been studied for more than 30 years. These problems incorporate some route order dependencies in which some nodes should be visited before others in order to transfer inventory between them. Similar to the VRP, the PDP is also an *NP-hard* problem [8], and some exact methodologies, based on branch-and-cut and branch-and-cut-and-price algorithms, have been developed to optimally solve small-sized PDP instances [26,27]. Moreover, several heuristics and metaheuristics algorithms have been developed to solve the PDP and some of its variants. We highlight the use of TS [28], GA [29], large neighborhood search heuristics (LNS) [30], adaptive LNS (ALNS) [31,32], particle swarm optimization (PSO) [33], and greedy clustering methods (GCM) [34]. A literature review and classification of PDPs is presented by Berbeglia et al. [35].

To the best of our knowledge, Abdulkader et al. [1] were the first authors to address the omnichannel VRP as a combination of the VRP and the PDP in an omnichannel retailing context. For solving this novel integrated problem, the authors proposed a two-phase heuristic, based on: (i) inserting consumers into retailers routes and on correcting infeasible solutions, and (ii) on joining the routes through the maximum-savings criterion, i.e., the Clarke & Write Savings heuristic (CWS) [36]. Apart from this two-phase heuristic, a multi-ant colony algorithm (MAC) was proposed. A complete set of instances have been generated to test their methodologies, and the MAC outperformed the heuristic's performance. Martins et al. [9] proposed a simple and deterministic heuristic for solving the OM-VRP. Although no best-known solutions were found, the heuristic showed to be promising since feasible solutions were provided in short computational time and no repair operations were needed. Recently, the same problem have been also solved by Bayliss et al. [37] and Martins et al. [38]. In contrast to Bayliss et al. [37], Martins et al. [38] have framed the problem in the humanitarian logistics field, providing an agile optimization solving methodology–which combines parallel computing with biased-randomized heuristics–which was proposed for solving it in real-time. Competitive results were found in milliseconds, enabling the agile optimization technique able to outperform other heuristics from the literature. On the other hand, Bayliss et al. [37] formulated the OM-VRP as a mixed-integer program (MIP) and proposed a two-phase local search with a discrete-event heuristic for solving the problem. In contrast to Abdulkader et al. [1]'s model, their formulation avoids the need to directly assign retailers to vehicles, since the pick-up and delivery constraints are modeled through routing variables. The first phase of the approach employs a discrete-event constructive heuristic, while the second phase aims to refine the most promising solutions obtained in stage one, by applying a sequence of local search neighborhoods. The authors have found new best-known solutions for the vast majority of problem instances, and improved lower bounds for a set of small instances were obtained.

Constructive heuristics are simple deterministic procedures that follow a logical sequence of decisions and which always generate the same solution when starting from the same point. Biased-randomized algorithms incorporate non-symmetric random sampling in order to diversify the

behavior of a base constructive heuristic [39]. At each stage of the base constructive heuristic, a list of candidate decisions is considered. The list is sorted in decreasing order of the benefit concerning an objective function, and a candidate is then randomly selected. In this random selection process, the higher-ranked candidates receive a higher probability of being selected. Biased-randomized algorithms have been employed for solving different COPs in transportation [40–42], scheduling [43,44], and facility location problems [39,45].

Simheuristic algorithms combine the use of approximate methods, such as heuristics or metaheuristics, with simulation techniques, in order to cope with stochastic combinatorial optimization problems. They have proven to be an efficient approach to solve stochastic problems [46]. The extension of traditional metaheuristics into simheuristics is gaining popularity, and they have already been applied to solve a wide set of stochastic optimization problems, in different fields such as permutation flow-shop [47], facility location problems [48], inventory routing problems [48–50], telecommunication networks [51] and finances [52].

Other examples of simheuristics include Lam et al. [53], Lopes et al. [54], and Santos et al. [55]. Lam et al. [53] employs a simheuristic approach for evolving agent behavior in the exploration of novel combat tactics. They use a genetic algorithm to find the states, during a flight maneuver, at which an aircraft should transit into the next phase of the maneuver. Lopes et al. [54] tackles a stochastic assembly line problem using a simheuristic approach. Demands for different product types are stochastic and occur in real-time. The optimization task lies in assigning tasks to stations in such a way that task precedence constraints are respected and the expected throughput is maximized. Santos et al. [55] develops a simheuristic based decision support tool for an iron ore crusher circuit. They propose and validate a simulation model of throughput efficiency depending on the amount of equipment active in each phase of the crusher circuit. The simheuristic integrates the simulation model with an iterated local search (ILS) metaheuristic. Their solutions improve throughput and reduce energy consumption compared to the all active equipment solution.

Despite the fact that travel times are stochastic in most real-life transportation activities, in the past only deterministic versions of the OM-VRP have been considered. Hence, our work goes one step further by considering random travel times and proposing a simheuristic algorithm to solve the stochastic version of the OM-VRP.

#### **3. Details on the Stochastic OM-VRP**

Usually, retail stores must be supplied from the depot with a large number of products, which are packed and frequently measured in terms of the number of pallets. In turn, and in contrast to retail stores' demand, online customer orders are of negligible size but require processing at a physical retail store before their delivery. Since orders must be processed at retail stores before delivery, products ordered online cannot be shipped directly from the central warehouse to online customers.

In the OM-VRP, a single fleet of vehicles is employed to simultaneously perform three different operations: (i) bulk deliveries to retail stores from a main central depot; (ii) the pick-up of online customer orders from these retail stores; and finally, (iii) the deliver of online customers' orders. Different product types are available at each retail store. Likewise, each retail store has a limited inventory of processed products that can be delivered to online customers. Therefore, the vehicle responsible for a particular online order must first visit a retail store with that product in stock. The available processed inventory and bulk demand for each retail store are known in advance. It is assumed that bulk inventory does not contribute to a retailer's available inventory since bulk deliveries cannot be processed within the drop time of delivery. As a simplifying assumption, each online customer orders a single product. When more than one item is ordered, the consumer is replicated in the same location as a 'virtual' consumer, and each product is delivered separately. According to Abdulkader et al. [1], this strategy of considering single-item orders by consumers guarantees the solution feasibility and also minimizes the distribution cost.

Figure 1 depicts how the processes of visiting retail centers and online customers are performed by the same fleet of vehicles. A route starts from the single and central depot, and the same cargo vehicle visits a set of retail centers and customers. Deliveries to customers cannot be directly performed from the depot. Hence, drivers must meet precedence constraints in order to pick-up customer orders from the retail centers before they can be delivered. Since retailers are supplied by the depot, these nodes are both delivery and pick-up points. On the other hand, customers require only the delivery of items from these retail centers, making them, delivery nodes. Apart from precedence and inventory constraints, the service of each route is limited by a maximum tour length and vehicle capacity limit.

**Figure 1.** A general schema of the OM-VRP.

We can define the distribution network of the OM-VRP as directed graph *G* = (*V*, *A*). The set of vertices *V* = {0, 1, 2, ... ,*r* + *c*} is composed of a single depot (0), *r* retail centers, and *c* online customers. In other words, the problem is represented by a set *R* of retail stores, which are supplied by the depot 0, and a set *C* of online customers, which are posteriorly supplied by the retail centers. The set *N* = *R* ∪ *C* comprises both sets *R* and *C*, in which *R* = {1, 2, ... ,*r*} and *C* = {*r* + 1,*r* + 2, ... ,*r* + *c*}. The same fleet of capacitated vehicles, initially stationed at the central depot at the start time *t*0, is employed for serving both the retail centers and consumers. For the complete mathematical formulation of this problem, readers are referred to Abdulkader et al. [1]. In this work, we extend previous work on this topic by considering the case in which edge-traversal times are stochastic. We define the time to traverse edge (*i*, *j*) as *Tij* = *tij* + *Dij* where *tij* is the deterministic edge-traversal time (which represents the edge-traversal time under 'ideal' traffic conditions), *Dij* is a log-normally distributed delay term, and *Tij* denotes the distribution of edge-traversal times. The objective of this extended problem is to minimize the total travel cost of the vehicle routes such that: (i) every route starts and ends at the depot; (ii) the routes do not violate the maximum tour length; (iii) every node *i* ∈ *N* is visited by only one vehicle and only once; (iv) the total load of the vehicle does not exceed the vehicle capacity; (v) the total consumer demand to be fulfilled by a vehicle for a specific product does not exceed the inventory picked up by the vehicle; (vi) the retail store determined to satisfy the consumer's demand must be visited before the consumer and by the same vehicle; and (vii) the routes must have a reliability level greater than a user-set parameter, *Rmin*. The reliability level of a set of routes is defined as the probability that all routes are completed within the maximum time limit of *Tmax*.

Two alternative mathematical formulations of the deterministic OM-VRP can be found in Abdulkader et al. [1], Bayliss et al. [37]. The differences in this case are a stochastic objective function and probabilistic constraint regarding route completion reliability. Let *xf kij* be a binary decision variable indicating whether or not vehicle *f* in the fleet *F* traverses edge *ij* in the *kth* node visit in its route. The stochastic objective function is given in Equation (1), where *Dij* is the stochastic delay associated with traversing edge *ij*.

$$\min \ E \left( \sum\_{f \in F} \sum\_{k \in K} \sum\_{i \in V} \sum\_{j \in V} x\_{fkij} \left( t\_{ij} + D\_{ij} \right) \right). \tag{1}$$

The probabilistic route completion reliability constraint is given in Equation (2).

$$R\_{\min} \le P\left(\sum\_{i \in V} \sum\_{j \in N} \sum\_{k \in K} \left(t\_{ij} + \tau + D\_{ij}\right) \ge\_{f \dot{k} \dot{\imath} j} + \sum\_{i \in V} \sum\_{k \in K} \left(t\_{i0} + D\_{i0}\right) \ge\_{f \dot{k} \dot{\imath} 0} \le T\_{\max} \ \forall f \in F\right). \tag{2}$$

Equation (2) also accounts for the drop times (*τ*) that are required when performing deliveries and pickups. In this work we employ a simheuristic approach for addressing the stochastic aspects of the OM-VRP. The main novelty of the OM-VRP, in comparison to other VRPs, is the precedence constraints regarding the collection of customer orders from retailers and subsequent delivery to customers. The vehicle capacity constraint regarding the retailer demands that can be satisfied are expressed by Equation (3), where *OPj* denotes the number of ordered product units of node *j* (which is zero for customer nodes) and *H* denotes the vehicle capacity.

$$\sum\_{i \in N} \sum\_{j \in V} \sum\_{l=1}^{k} x\_{flij} OP\_j \le H, \forall f \in F, \ \forall k \in K. \tag{3}$$

The customer order precedence constraints are expressed by Equation (4), where *ODjq* denotes the number of items of type *q* ∈ *Q* ordered by node *j* (which is zero for retailer nodes) and *Pjq* denotes the number of items picked up of type *q* ∈ *Q* at node *j* (which is zero for customer nodes).

$$\sum\_{i \in V} \sum\_{j \in V} \sum\_{l=1}^{k} \mathbf{x}\_{flij} \left( P\_{jq} - OD\_{jq} \right) \ge 0, \forall f \in F, \ \forall k \in K, \ \forall q \in Q. \tag{4}$$

Table 1 provides the details for a small-sized instance, which is composed of one central warehouse, four retail stores, and nine online customers. For this instance, Table 1 provides: the geographic coordinates (*X* and *Y*) of each node, the demand required from the depot by each retail center *i* ∈ *R*, the demands for each customer *j* for each product type *q* (*ODjq*), and the inventory of each available product *Piq*, *q* ∈ {1, 2, 3} at the retail centers. Since the products are negligible in terms of capacity, they do not affect the vehicle load capacity. Accordingly, Figure 2 presents the optimal solution for the provided example. The routes are performed with a total cost of 455.91 distance units, and a fleet of vehicles each with a capacity of 100 weight units is employed for performing the routes. In the first route, customers 8, 5, and 9 are served by retail center 1, while customers 12 and 13 are served by retail 3. The route requires 272.37 distance units, and the vehicle starts with 93 loaded demand units. Note that the retail centers are first visited by the vehicle in order to deliver the required demand from the depot and to pick-up the products ordered by the customers. Similarly, in the second route, customers 11 and 10 are supplied by retail center 4 while customers 6 and 7 are served by retail store 2. The route requires 183.53 distance units, starting with 79 loaded demand units.


**Table 1.** Problem instance data.

**Figure 2.** The optimal solution routes for Table 1.

#### **4. Methodology**

The multi-start framework belongs to a family of metaheuristics algorithms [56] which includes approaches such as Evolutionary Algorithms [57,58] and Swarm Intelligence Algorithms [59–62], among others. Metaheuristics were introduced in order to provide high-quality solutions using a reasonable amount of computation time and memory. Typically, they require a time-consuming parameter tuning process. Although, there are approaches that provide self-adaptive parameter control [58], our methodology has been devised to work with a reduced number of parameters, providing an excellent trade-off between simplicity, computational time, and solution quality.

In order to solve the stochastic OM-VRP, a simheuristic approach is proposed. Our methodology combines a biased-randomized savings-based heuristic with simulation techniques to deal with the stochastic aspects of the problem. Biased-randomized heuristics can be extended into simheuristics in a natural way. Starting from a savings-based heuristic, which is completely deterministic, this solution method is then extended into a biased-randomized multi-start algorithm. Also, a local search procedure is applied to search for locally optimal solutions. Finally, the multi-start framework is extended into a simheuristic approach in order to deal with stochastic travel times.

#### *4.1. Savings-Based Heuristic*

As introduced, the savings-based heuristic (LH) [9] is based on greedy decisions and designed to solve the deterministic OM-VRP. The LH is composed of the following three stages:


Initially, a savings list (*SL*) is constructed (step 1). This list considers all possible pairs of nodes–i.e., edges–from the problem. For each edge {*i*, *j*}, the corresponding savings value is calculated as *sij* = *t*0*<sup>i</sup>* + *tj*<sup>0</sup> − *tij*, where *tij* represents the deterministic travel time between nodes *i* and *j*. That is, candidate solutions are generated using 'ideal' traffic conditions for the edge-traversal times.

Initially, all edges from SL are eligible. The list is sorted in descending order of the savings value (step 2), and the edge with the highest saving is selected (step 3). At this stage, the selection of edges is restricted to guarantee the assignment of a retail center to each consumer in the problem (step 5). In other words: a route containing one single customer can only be merged with a route containing a retail center, i.e., the selection is restricted to eligible edges {*i*, *j*}, where node *j* is a consumer in a dummy route and *i* is a node in a route with a retailer that can supply consumer *j*. According to Martins et al. [9], these attempts at only merging routes containing at least one retail center, which can supply a single consumer, are made first in order to avoid infeasible solutions–i.e., solutions in which some customers are not assigned to any retailer. This approach of addressing solution feasibility first is based on the observation that the availability of feasibility restoring merges will only decrease as the algorithm progresses.

Based on CWS route-merging conditions (step 6), the two corresponding routes, *i* and *j*, of an edge {*i*, *j*} (obtained in steps 4.1 and 4.2) can be merged only if: (i) nodes *i* and *j* are exterior in their respective routes (a node is exterior if it is adjacent to the depot); (ii) *i* and *j* belong to different routes; (iii) the maximum tour length is not violated; and (iv) the vehicle capacity is not violated.

The selected edge is deleted from *SL* (step 9) only if: (a) the corresponding merge is performed (step 7); or (b) at least one of the CWS constraints ((i)–(iv)) are violated (in step 6). Otherwise, the edge becomes temporarily ineligible (step 10), but it is not removed from the list since subsequent merges might restore eligibility. This can occur when a different retail center is merged into a route, increasing the available processed inventory for subsequent consumers. For example, when selecting an edge (*i*, *j*), the evolving route of *i* may have insufficient processed inventory for customer *j* at this time. However, if in subsequent iterations route *i* is merged with another route containing retailers, the evolving route of *i* may then be able to serve customer *j*.

When a merge is successfully performed (step 7), the entire *SL* becomes eligible (step 8), since a new inventory scenario is generated.

At the end of this stage, all the consumers are supplied by the retail centers, guaranteeing a feasible final solution. Notice that this is achieved without solving separate assignment and routing problems, as done in Abdulkader et al. [1].

3. Finally, the third stage (CWS2) tries to improve the solution generated in the previous step. To do that, the algorithm cycles through the SL list, which includes the remaining saving edges that were discarded in the previous step, with the aim of identifying more beneficial merges. Unlike the procedure used in the CWS1 stage, in this phase all the customers are already assigned to a retail center, so step 5 of Figure 3 is not required. Hence, all edges are eligible. The process attempts all the available merging possibilities which may improve the solution. Each time a new edge is selected from the SL list, it is removed from the list, whether the corresponding merge is performed or not–due to it violating any of the constraints (i)–(iv). In each new iteration, the highest saving edge is selected to restart the merge process. This process is repeated until *SL* is empty. At the end of the procedure, a feasible solution is generated, without the necessity of repair operations.

**Figure 3.** The flowchart of CWS1.

#### *4.2. Introducing a Local Search*

Once an initial and feasible solution is constructed, a fast local search is applied to improve its quality. This local search mechanism is based on a *2-opt* movement. Since this problem has some precedence particularities, the *2-opt* movement, which consists of the reversal of sub-sequence of nodes, is restricted to movements that do not violate the precedence order between customers and their suppliers–hence, the feasibility of the solution is preserved.

#### *4.3. Extending to a Biased-Randomized Algorithm*

To modify the original greedy and deterministic behavior of our heuristic, the selection of candidates from the *SL* is randomized by introducing a skewed probability distribution into the selection process. For the biased-randomized component, we employ a geometric distribution, which is controlled by a single parameter, *β* ∈ (0, 1). *β* values closer to 1 increase the probability that the highest saving merges are selected. On the contrary, as *β* approaches 0, a near-uniform randomization is obtained. In this way, *β* controls the level at which the selection probabilities decrease along the sorted *SL*. Hence, unlike deterministic heuristics, which always generate the same solution when starting from the same initial solution, different decisions are taken at each selection iteration, consequently generating different solutions. Algorithm 1 represents the selection process of LH with biased-randomization, which replaces the greedy selection of edges from the SL.


Considering all of the stages which have been introduced, Algorithm 2 represents the overall structure of our biased-randomized algorithm with local search (BRLH). This algorithm is then embedded into a multi-start procedure in order to obtain a variety of solutions (MSBRLH).


#### *4.4. Extending to a Simheuristic Approach*

Deterministic travel times are widely assumed in transportation problems. However, when dealing with real-life problems, which are often fraught with uncertainty, travel times are usually stochastic in nature. As introduced in Section 3, the deterministic travel time employed to traverse edge (*i*, *j*), *tij*, can be seen as the travel time required under ideal traffic conditions. In the proposed extension, the stochastic time to traverse (*i*, *j*) is computed as *Tij* = *tij* + *Dij*, where *Dij* follows a *logNormal*(*μ*, *σ*) probability distribution, and represents a random delay resulting from uncertain conditions. Since the *logNormal* probability distribution can only take positive values, it follows that *Tij* > *tij*, ∀(*i*, *j*) in the set of connecting edges.

Algorithm 3 describes our simheuristic approach (SimBRLH). Initially, a solution is generated by our BRLH in line 1 (Algorithm 2), by employing the greedy approach (i.e., *β* ≈ 1). A short simulation is then performed on this initial solution (line 2), in order to estimate its average stochastic cost. This initial solution is set as the best-found stochastic solution cost (line 4). While the termination criterion is not met (line 6), different solutions are generated by BRLH (line 7). The deterministic cost of the initial solution is considered for guiding the search. Therefore, a solution is accepted for being submitted to the simulation module (line 10) only if its deterministic cost is smaller than the best-found deterministic solution cost plus *m*% of its value (line 9). This solution filtering approach reduces the amount of time spent on testing unpromising solutions in computationally expensive simulations. Moreover, by allowing the acceptance of moderately worse solutions, controlled by the parameter *m*, a better

exploration of the solution space can be achieved [63]. At this stage, *qshort* Monte Carlo simulation runs are used to test the accepted solution. Each simulation run replaces the deterministic travel times of a solution with randomly sampled ones–according to the assumed probability distribution. From this complete simulation process, the average stochastic cost of each solution is computed. Every time a new best stochastic cost is found (line 15), this solution is introduced into a pool of 'elite' solutions *E* (line 17). This process is repeated while the termination criterion is not met. On this reduced set of solutions, *qlong* Monte Carlo simulation runs are performed (line 23) in order to generate more accurate results for solutions in stochastic environments. During the simulation process we also obtain an estimate of the reliability rate of a solution [64]. This estimate is computed as the rate at which all routes show completion times lower than the maximum allowed travel time. At the end, the set of elite solutions is sorted in descending order of their expected cost (line 25), and the best-found stochastic solution is provided to the manager.


**return** *bestStochSol*

#### **5. Results and Discussion**

To test the proposed methodologies, we have used the 60 problem instances introduced in Abdulkader et al. [1]. These instances differ in the number of retail centers (*r* ∈ {10, 15, 20, 25}), online customers (*c* ∈ {20, 50, 75, 100, 150}), and also in the inventory scenarios of the retail centers (tight, relaxed, and abundant). Therefore, for each inventory scenario, 20 different problem combinations are generated. The maximum tour length of the routes and the vehicle capacity are set to 8 h and 100 weight units, respectively. While the BRLH is guided by a single parameter, *β*, the simheuristic approach is also controlled by the maximum running time *timemax*, the acceptance margin of worst solutions *m*, and the number, *qshort* and *qlong*, of simulation repeats in short and long simulation runs, respectively. The stochastic travel times for each edge are set by the log-normal distribution parameters, *μ* and *σ*.

Table 2 summarizes the setup of the parameters employed during the computational experiments. For calibrating these parameters, we have used the methodology proposed in Calvet et al. [65], which is based on a general and automated statistical learning procedure. Regarding the maximum run time *timemax*, the value was set depending on the size of the instance (*c* + *r*) × 0.342, which leads to a maximum execution time of 60 s in the case of the largest instance. This stopping criterion is employed in both the multi-start and simheuristic strategies.

The algorithms were coded in Java and all tests were performed on an Intel Core i7-8550U processor with 16 GB of RAM.

**Table 2.** Parameter setup.


Note that three different values have been considered for the *σ* parameter, which is used to modify the deterministic travel times. Since the maximum tour length is fixed independently of the size of the instances, small-sized instances are more likely to generate short routes. Therefore, a larger value for *σ* introduces more variability in the travel times, which increases route failure rates. On the other hand, large-sized instances are often composed of larger routes, then a small value for *σ* is introduced. In particular we have, *σ* = 2.5 for small instances (composed of 25 customers), *σ* = 1.55 for large instances (composed of 150 customers), and *σ* = 1.9 for the remaining medium-sized ones. This approach ensures that small, medium, and large instances each have similar levels of difficulty with respect to the risk of route failure.

The initial analysis aims to compare the solutions generated by our greedy heuristic (LH) and by our MSBRLH–in which *β* is (uniformly) randomly selected in the interval [0.45, 0.75]–with the solutions obtained by the two-phase heuristic (AH) and multi-ant colony metaheuristic (MAC) proposed by Abdulkader et al. [1]. Their methodologies were performed on four 2.1 GHz processors with 16-cores each and a total of 256 GB RAM. That is, we initially focus on comparing each algorithm in terms of deterministic travel cost. Tables 3–5 present the results obtained for tight, relaxed, and abundant inventory scenarios, respectively. For each problem instance (I), we present results for: the cost of the best-found solution obtained by the different methodologies; the average cost of our MSBRLH; the CPU time (in seconds) required by each methodology; and their percentage gaps. The best results returned by the solution methodologies are highlighted in bold. Figures 4 and 5 present how both the gap and the cost of the solutions, i.e., the objective function (OF) value, behave according to the employed solution approach and inventory scenario, respectively.


**Table 3.** Comparison of the results obtained by our methodologies (LH and MSBRLH) with those obtained by Abdulkader et al. [1]'s methods (AH and MAC) in the tight inventory scenario.

**Table 4.** Comparison of the results obtained by our methodologies (LH and MSBRLH) with those obtained by Abdulkader et al. [1]'s methods (AH and MAC) in the relaxed inventory scenario.



**Table 5.** Comparison of the results obtained by our methodologies (LH and MSBRLH) with those obtained by Abdulkader et al. [1]'s methods (AH and MAC) in the abundant inventory scenario.

**Figure 4.** Gap between our best-found solutions (from MSBRLH) and the MAC's results, for each inventory scenario.

**Figure 5.** Comparison of the solutions cost (OF value) from each solving approach, for each inventory scenario.

By analyzing Tables 3–5, we can observe that our MSBRLH algorithm is able to improve previous results (from LH) by 9–12%, on average (column *gap (2)–(1)*). When comparing the MSBRLH with the AH heuristic (column *gap (3)–(2)*), our approach is able to reduce solution costs by up to 26% in short computational times (about 18 s on average). On the other hand, when comparing our results with those generated by the MAC approach (column *gap (4)–(2)*), our solutions are between 9% and 21% worse, on average. Particularly, in the abundant inventory scenario, MSBRLH's results are only 9% worse than MAC. Notice, however, that the processing time required by MAC is substantially larger in all inventory scenarios. By analyzing Figure 4, it is evident that MSBRLH performs better in the abundant inventory scenario, being able to find one better solution and several others with a maximum gap of 8%. Moreover, we can observe a variability of around 10% in the gap between MSBRLH and MAC on average. This variability is reduced to around 8% for the relaxed and abundant scenarios. These results demonstrate the robustness of our solution approach for the deterministic case. When analyzing Figure 5, which presents the overall performance of each solution approach for each inventory scenario, we can observe that our multi-start strategy is more efficient than both LH and AH

heuristics, by generating solutions with a lower cost. To complement these box-plots, an ANOVA test was run for each inventory scenario. The *p*-values associated with the tight, relaxed, and abundant inventory scenarios were, respectively, of 0.001, 0.000, and 0.000. Also, the Fisher's LSD test suggests significant differences in all tree scenarios between MAC and AH, between MS-BRLH and AH, as well as between MAC and LH. However, cost differences between MAC and MS-BRLH were not significant in any scenario, despite the fact that MAC employed a noticeably higher amount of computing time than MS-BRLH.

Next, in Figure 6, we present the convergence of three problem instances' solutions, including one from each different inventory scenario (instances *b*6, *b*26, and *b*46), by comparing the MSBRLH with the best-known solutions, in terms of gap. As introduced, these instances require approximately 15 s of processing time, given their magnitude.

**Figure 6.** Comparison between MSBRLH and MAC on solving large-sized instances.

As we can see in Figure 6, the instances demonstrate the same convergence behavior as solution time increases. It is noticeable that the convergence rate is abrupt during the first few seconds. However, contrary to the tight and relaxed inventory scenarios, where the solutions continue improving over time, the search demonstrates more stable convergence during the remaining execution time in the abundant scenario. Being more efficient in the more flexible inventory case, our multi-start approach quickly achieves its best solutions.

Next, we want to compare the quality of the solutions generated for the deterministic scenario (MSBRLH) against the solutions generated for the stochastic scenario (SimBRLH). Since the only difference between the deterministic and stochastic scenarios is that stochastic delays are added to edge traversal times, we can consider the deterministic cost of the best solutions generated by MSBRLH as a lower bound (LB) of the stochastic travel times of the best SimBRLH solution. Moreover, since MSBRLH does not account for stochastic travel times, we can consider the stochastic travel time of the best MSBRLH solution as an upper bound (UB) for the stochastic travel time of the best SimBRLH solutions. Table 6 provides both LBs and UBs values and the best-found stochastic travel times obtained by our SimBRLH. The solutions reported in the SimBRLH column are the best-found stochastic travel times.


**Table 6.** Analysis of the results obtained by our SimBRLH on scenarios of tight, relaxed and abundant inventory.

As we can see in Table 6, all the SimBRLH solution costs are between the LB and the UB, as expected. For 24 problem instances, the solution returned by our simheuristic is better than the best deterministic solution when it is tested in the stochastic scenario (the *UB* column). From this, we can assert that our SimBRLH is able to generate competitive results for the stochastic scenario. The reliability value is calculated by simulation for each solution and represents the probability that all routes are completed within maximum tour duration. For visualizing this trade-off between the deterministic cost of the solutions and their reliability rate, which are conflicting objectives, a Pareto frontier of non-dominated is presented. Accordingly, Figure 7a–c present the non-dominated solutions for three different instances (*b*17, *b*37, and *b*57), each one belonging to a different inventory scenario. A solution is non-dominated if, no other solution has a greater reliability and a lower or equal travel cost, or if no other solution has a lower travel cost and a greater or equal reliability level. The *b*17 solution was randomly chosen, while the *b*37 and *b*57 solutions are for the same problem but set in the two other inventory scenarios. The square orange dot represents the best deterministic solution found by our MSBRLH, while the remaining ones, round and blue, represent different solutions with a higher reliability rate, but with higher operating costs.

As we can see in Figure 7, despite being the solutions with the lowest cost, the best deterministic solutions (square dots) are the least reliable ones for stochastic scenarios. Particularly in Figure 7a, the best deterministic solution is approximately only 21% reliable under stochasticity. By selecting higher-cost solutions, the reliability rate reaches more than 70%. The same behavior is noticed in Figure 7b,c, however, the deterministic *b*57 solution is reasonably reliable. Usually, low-cost solutions are made up of a small number of large vehicle routes, in terms of travel distance and time. Therefore, when increasing the travel time variability in those scenarios, the risk of exceeding the time constraint is higher. On the other hand, higher cost solutions are built from a larger number of smaller routes, and smaller routes exhibit a lower risk of violating the maximum tour duration constraint in stochastic scenarios. In this way, decision-makers should consider that low-cost solutions under deterministic scenarios might not necessarily be the best option when stochasticity is taken into account.

**Figure 7.** Set of non-dominated solutions of problem instances *b*17 (**a**), *b*37 (**b**) and *b*57 (**c**). (**a**) Non-dominated solutions for problem instance *b*17 (tight inventory). (**b**) Non-dominated solutions for problem instance *b*37 (relaxed inventory). (**c**) Non-dominated solutions for problem instance *b*57 (abundant inventory).

#### **6. Conclusions and Future Work**

With the emergence of online retail channels and the popularization of mobile devices, new retailing modes have become popular. Some of these retailing practices allow customers to browse through different online stores and, then, to get the items bought directly delivered to their homes. Hence, new versions of the vehicle routing problem (VRP) considering additional decision variables and constraints have emerged. Omnichannel retailing leads to an integrated problem combining the VRP and the pick-up and delivery problem. In omnichannel distribution systems, a set of retail stores need to be replenished and, at the same time, products have to be sent from these stores to final customers. The resulting omnichannel VRP consists in two stages: (i) a group of retail stores that must be served from a distribution center; and (ii) a set of online consumers who must be served, by the same fleet of cargo vehicles, from these retail stores.

For solving the deterministic OM-VRP, a simple heuristic was initially introduced. This heuristic was then extended into a multi-start biased-randomized algorithm, which is tested against the state-of-the-art methodologies. Our biased-randomized algorithm performs reasonably well in a set of 60 instances of the deterministic OM-VRP, which allows us to extend it into a full simheuristic for solving the stochastic version of the problem. This is a more realistic version of the OM-VRP where travel times are modeled as random variables following a log-normal distribution. Our simheuristic approach is also capable of measuring the reliability of any proposed solution when it is employed in a stochastic scenario. To the best of our knowledge, this is the first time that such a stochastic variant of the problem has been solved in the literature. Regarding the simheuristic results, we conclude that the best deterministic solutions may perform badly when used in a stochastic scenario. Those solutions are often not reliable in terms of completing all routes within a time limit. On the other hand, our simheuristic approach was able to generate reliable and competitive results for these stochastic scenarios. Therefore, our methodology enables decision makers to choose the solution that better fits his or her utility function in terms of cost and reliability level.

Two-echelon distribution systems, as the one considered here, are typically characterized by the use of different fleets of vehicles at each distribution level. Apart from considering a single fleet of vehicles for serving both delivery levels, the products ordered by customers were assumed to not affect the vehicles' capacity. Therefore, future directions of research include the consideration of non-negligible sized online customer orders, heterogeneous vehicle fleets, and the incorporation of positive demands for multiple product types for each customer. Regarding the solution approach, different perturbation stages and local search operators could be tested in order to speed up the convergence process towards near-optimal solutions. This might be particularly relevant when considering large-sized instances.

**Author Contributions:** Conceptualization, L.d.C.M., C.B. and A.A.J.; methodology, L.d.C.M., C.B., P.J.C.-M., J.P. and A.A.J.; software, L.d.C.M. and P.J.C.-M.; writing, L.d.C.M., C.B., P.J.C.-M., J.P. and A.A.J.; validation, L.d.C.M., P.J.C.-M., J.P. and A.A.J. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** This work has been partially supported by Rhenus Freight Logistics GmbH & Co. KG and by the Spanish Ministry of Science, Innovation, and Universities (PID2019-111100RB-C21, RED2018-102642-T). We also acknowledge the support of the Erasmus+ Program (2019-I-ES01-KA103-062602).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Simheuristics Approaches for Efficient Decision-Making Support in Materials Trading Networks**

**Markus Rabe \*, Majsa Ammouriova, Dominik Schmitt and Felix Dross**

IT in Production and Logistics, Faculty of Mechanical Engineering, TU Dortmund, 44227 Dortmund, Germany; majsa.ammouriova@tu-dortmund.de (M.A.); dominik.schmitt@tu-dortmund.de (D.S.);

felix.dross@tu-dortmund.de (F.D.)

**\*** Correspondence: markus.rabe@tu-dortmund.de

**Abstract:** The distribution process in business-to-business materials trading is among the most complex and in transparent ones within logistics. The highly volatile environment requires continuous adaptations by the responsible decision-makers, who face a substantial number of potential improvement actions with conflicting goals, such as simultaneously maintaining a high service level and low costs. Simulation-optimisation approaches have been proposed in this context, for example based on evolutionary algorithms. But, on real-world system dimensions, they face impractically long computation times. This paper addresses this challenge in two principal streams. On the one hand, reinforcement learning is investigated to reduce the response time of the system in a concrete decision situation. On the other hand, domain-specific information and defining equivalent solutions are exploited to support a metaheuristic algorithm. For these approaches, we have developed suitable implementations and evaluated them with subsets of real-world data. The results demonstrate that reinforcement learning exploits the idle time between decision situations to learn which decisions might be most promising, thus adding computation time but significantly reducing the response time. Using domain-specific information reduces the number of required simulation runs and guides the search for promising actions. In our experimentation, defining equivalent solutions decreased the number of required simulation runs up to 15%.

#### **1. Introduction**

Managing distribution networks is a challenging task for decision-makers. The specific challenge in this field is the complex structure of a typical network, which is a multi-echelon system with horizontal and vertical shortcuts in combination with huge numbers of nodes as well as transported parts, each kind of which is defined as a stock keeping unit (SKU). In addition, this kind of logistics operates in a highly volatile environment, requiring continuous adaptations by the responsible managers who need to conduct frequent maintenance and improvement decisions, especially on tactical horizons. In practice, decision-makers face a substantial number of potential improvement actions, spanned by the huge number of objects that can and have to be combined for suitable maintenance of the network [1,2]. Furthermore, conflicting goals, such as simultaneously maintaining a high service level and low costs, characterized by specifically defined key performance indicators (KPIs) [3], add to the complexity of the problem. Short delivery times versus low costs as well as a large number of delivered SKUs by their due date versus a large number of fully completed orders are other examples of conflicting goals in the network.

The first challenge is the specific shape of the problem. Traditional parameter optimisation tasks—easily found in the broad literature—are described by a given (finite) number of parameters, each with a given number of possible values. The size of the solution space, in this case, is defined as the product of the number of all parameters. However, in the network covered in this paper, there is an (in principle) unlimited number of actions to

**Citation:** Rabe, M.; Ammouriova, M.; Schmitt, D.; Dross, F. Simheuristics Approaches for Efficient Decision-Making Support in Materials Trading Networks. *Algorithms* **2021**, *14*, 23. https:// doi.org/10.3390/a14010023

Received: 14 December 2020 Accepted: 8 January 2021 Published: 14 January 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

change the network, each of which could be any of a given limited set of actions, applied to any of the object combinations. Therefore, traditional algorithms are not applicable because of the daunting combinatorics. Just as an example, traditionally a parameter entry of the algorithm stands for a specific characteristic of the underlying system. Here, however, it only stands for any kind of action. Therefore, it is also the case that the sequence of the elements is clearly relevant, in contrast to the cases that are usually found in the state-of-the-art literature.

Due to their complexity, real-world networks cannot obtain a closed mathematical formulation of the goal functions [4]. Therefore, an application might resort to a three-stage solution procedure. In the first stage, changes to the system are defined that are expected to keep or even improve the KPIs. In the second stage, a discrete event simulation (DES) model [5] is run to determine basic logistical data, such as throughput times, in-time delivery, or utilisation of resources. In the last stage, these data are used as in the real world to calculate the KPIs, which then serve to judge the quality of the changes that have been set up in the first stage.

This three-stage process can, of course, be conducted manually by human experts, but there have been a number of trials to automate the optimisation of distribution networks. Unfortunately, due to the huge solution space, there is again no closed mathematical formulation to find a minimum or maximum of the desired KPIs, leading to the necessity of applying heuristic algorithms, which do not deliver the optimum, but—hopefully—a "suitably good" solution in finite time [6,7]. Specific heuristics would raise extremely high development and implementation effort for such complex applications. Therefore, reported implementations exploit metaheuristics, often biology-inspired, such as evolutionary algorithms [8]. Metaheuristics are optimisation algorithms that are—at least up to a point—independent of the specific application problem, with the great advantage that new developments can be exploited for a multitude of applications, and, thus, very sophisticated and efficient algorithms are available.

The scientific solution approach applied in this paper is consequently the method of simheuristics [9], which combines DES used as an evaluation function with metaheuristics for the optimisation. The major drawback of this approach is the time to produce promising action proposals. The tremendous time requirements originate from the huge solution space described before in combination with the significant runtimes of the complex simulation models [10], the latter being in the range of minutes to even hours per single run.

There have been very different approaches proposed to face this problem. Optimisation of the metaheuristics performance itself is hardly promising, as the processing of these algorithms covers only a very minor part of the total run time—by far the major part is covered by the DES. An obvious idea is to develop advanced simulation models that can evaluate the distribution system faster. Alternatives address the effort to obtain statistical relevance, for example, find better controls for the number of required replications or adapt such numbers to the degree that a specific solution seems to be promising [11–13]. This article explores two different innovative ways.

Actually, the real performance challenge is not to minimise the effective computation time *TC* of the computer resource, but to reduce the time required to present a decision proposal *TD*, defined as the time span between the availability of data and the provision of acceptable solution proposals. With metaheuristics, such as an evolutionary algorithm (EA) [7], *TD* can only be reduced with respect to *TC* when additional resources are applied. For example, using four computer processors instead of one will cut *TD* to about 25% of *TC*. With ten computers, it will reduce to about 10% (assuming that the distribution overhead is negligible). However, the tactical decisions are performed, for example, once per month with a decision request three days after availability of field data in the data warehouse. Thus, ten computers might idle 27 days waiting for the next decision request. The solution approach is to apply machine learning, where a learning algorithm is operating for 90% of the time, and in the concrete decision situation the acquired knowledge is used to quickly find promising solutions. We consider deep reinforcement learning for this purpose, as

it allows us to estimate the impact of specific changes, even if these have never been simulated in the past, by using the "fuzzier" knowledge within the deep learning's neural network [14,15].

The second innovative idea is to get domain-specific information (DSI) back into our metaheuristic algorithm [16]. We analyse three different approaches to use contextual information for the acceleration of the heuristics machine. One approach exploits experience from previous (actual or simulated) applications of actions, following the idea that actions that have shown to be helpful in the past might lead to good results in the future as well. This method also allows us to exploit the idle time between the decision periods to collect experience about the actions' success and, thus, to improve the forecast power of the success indicators. A second approach uses the classification of actions that either change the network's structure or else just its parameters, assuming that structural changes would be more targeted if applied among the first actions of the action plan. The last idea assumes that the actions are not independent, and one action being performed will influence the impact of further actions. In this approach, the correlation among actions is computed for data from the past and projected into future action plans. Again, idle time can be used to calculate and continuously improve the correlation indicators.

Finally, the number of simulation runs could be reduced, and we discuss two ways towards this goal. On the one hand, different action plans are analysed to determine whether they can be predicted to gain identical results without the need to simulate both of them. This would lead to performance improvement without reducing the result quality (or, from a different point of view, to achieve better results within the same time frame). On the other hand, the reduction of the solution space itself by grouping actions into fewer selectable items is considered [17]. It can be assumed that this approach, quite usual in real-world decision making, will lead to a significant reduction of simulation runs. However, the reduction of the research space might exclude some even better solutions, which in fact could be acceptable if the method leads to better solutions within a limited given time frame.

For all these approaches, we have developed suitable implementations and evaluated them with subsets of real-world data. In this paper, we give an overview of the implementation, present exemplary results, and make conclusions about the suitability of the investigated approaches. The paper is organised as follows—Section 2 presents the related work and Section 3 the considered optimisation approaches. Section 4 introduces the general ideas and architecture of the developed logistics assistance system, and it clarifies the relationship to real-world data. Sections 5–7 discuss the above-mentioned approaches one by one, followed by an evaluation of the novel concepts in an evaluation based on real-world data in Section 8. The discussion relates the major findings to the previous state of the art and derives future research paths, followed by a summary of the achievements in Section 9.

#### **2. Related Work**

#### *2.1. Simulation of Distribution Systems*

Management of logistics distribution systems is a complex task; hence, decisionmakers use models to represent and study these systems. A model is a "simplified reproduction of a planned or existing system with its processes in a different conceptual or concrete system" [18] (p. 3). In the models, decision variables are under the control of decision-makers and present the input to the models [19]. Decision-makers change the values of decision variables and use the output of models to study the relationships between the variables and the performance of the modelled system.

Distribution systems have a high degree of interaction between their entities and are characterised as having time-dependent variables [18]; thus, mathematical models [5] reach their limits in modelling distribution systems and, instead, simulation is used to model them.

Simulation studies are conducted in several phases that are, for example, illustrated in the procedure model presented by Rabe et al. [20]. This model shows the phases targeted to guarantee the building of a representative model of the system under study. It highlights the importance of verification and validation of the model as well as the data in the simulation study [21].

A variety of model representation techniques using simulation modelling have been developed, such as DES [22]. The discrete event simulation is characterised as modelling a system by focusing on its discrete states. Fanti et al. [23] claimed that DES is the most preferred simulation approach for logistics systems. Pujawan et al. [24] modelled a cement distribution system using DES to estimate costs and service level. In another study, Fang and Li [25] evaluated various inventory scenarios using DES. Ivanov [26] studied the effect of disruption at one point in a network using DES; he used DES to perform a sensitivity analysis.

#### *2.2. Optimisation of Distribution Systems*

In optimisation problems, decision variables are optimised. These variables might have continuous values or countable values [7]. Optimisation problems with countable decision variables are integer programming problems, sometimes called combinatorial optimisation problems [27]. Such problems are based on combinatorics [28], in which a solution is formed from a finite space of elements to optimise an objective function [29]. Optimisation of distribution systems could be formulated as a combinatorial optimisation problem [30], such as the travelling salesman problem, in which the task is to arrange a tour through a number of cities to minimise the total travelled distance.

Greedy algorithms have been used to find the approximately optimal solutions of simple combinatorial optimisation problems [31]. Adding constraints to these problems or increasing their size increases the difficulties to solve them, and the problems might become NP-hard [4]. The travelling salesman problem is an example of an NP-hard problem that is solved by approximate methods (such as metaheuristics) to find promising feasible solutions [31]. "A metaheuristic is a high-level problem-independent algorithmic framework that provides a set of guidelines or strategies to develop heuristic optimisation algorithms" [32] (p. 960).

Researchers used metaheuristics to solve a variety of combinatorial optimisation problems, such as Osaba et al. [33]. Cybulski [34] found that evolutionary algorithms exhibited promising performance in different benchmark combinatorial optimisation problems. Other researchers combined metaheuristics and other methods in a hybrid approach [7], such as the simheuristics approach [35].

#### *2.3. Simheuristics*

For the optimisation of complex systems, simulation and optimisation methods are combined in simulation-optimisation methods [36], which depend on the optimisation algorithm that is used *and* the simulation purpose. The VDI-Guideline 3633.12 [37] classifies the relation between simulation and optimisation algorithms into four categories: "Category A" in which simulation follows optimisation, "Category B" with optimisation following the simulation, "Category C" integrating an optimisation algorithm in the simulation, and "Category D" in which simulation is integrated in an optimisation algorithm.

A variety of optimisation algorithms could be combined with simulation, for example, metaheuristics. The integration of simulation in metaheuristics is called simheuristics [35] and is classified as "Category D" by VDI-Guideline 3633.12 [37]. Simheuristics combines the power of metaheuristics and simulation. The metaheuristic algorithm forms solutions that are evaluated by simulation with respect to the objective function in the optimisation problem.

Researchers have used simheuristics to solve a variety of combinatorial optimisation problems. For example, Juan and Rabe [35] proposed an approach that outperformed a traditional method to solve an inventory routing problem. They combined Monte Carlo simulation and the best-known heuristic that solves the problem. Juan et al. [9] proposed a framework to handle a stochastic type of combinatorial optimisation problems with moderate volatility. Jackson et al. [38] combined DES and a genetic algorithm to handle the stochasticity in an inventory management problem. Discrete event simulation and a genetic algorithm were used to design a supply network by Gutenschwager et al. [39]. Pages-Bernaus et al. [40] investigated a facility location problem. They found that a simheuristic approach outperformed other stochastic programming methods to solve the problem.

#### *2.4. Literature Summary and Contributions*

Distribution networks are complex networks with a large number of possible actions that might be used to improve them. These networks are difficult to model adequately in a mathematical formulation, and, thus, simulation is often used for their study. In order to optimise such networks with their conflicting goals, metaheuristic algorithms could be used to provide a "good solution". Because these networks are complex, researchers have found it beneficial to combine simulation and metaheuristics in a simheuristic approach to optimise them. This approach forms the basis for a logistics assistance system (LAS) that is described in Section 4. In the LAS, actions are selected to construct action plans in which the order of the actions is significant. However, the LAS has a long decision proposal time, because simulation forms the major part of the optimisation run time.

Researchers have suggested approaches to reduce the number of evaluations in an optimisation algorithm, such as screening solutions [41]. Other researchers proposed random biased selection of the solution's elements to improve the optimisation [42]. None of these strategies addressed the presented action plan problem that is a combinatorial optimisation problem in which actions are selected.

Since decision-makers look for a "good solution" in a reasonable time for the decision proposal even in large and complex networks, our paper proposes innovative approaches to reduce this time. In the first approach, reinforcement learning is proposed to reduce the decision proposal time. Sections 3.2 and 5 describe reinforcement learning and its innovative implementation in the LAS. Another innovative approach defines a network's domain-specific information and utilises it to recommend actions (Section 6). The last proposed approaches reduce the number of simulation runs by defining equivalent solutions and reducing the actions' search space (Section 7).

#### **3. Optimisation Approaches**

#### *3.1. Evolutionary Algorithms*

Evolutionary algorithms are metaheuristic algorithms developed to solve optimisation problems. In evolutionary algorithms (EAs), solutions are represented as individuals in a population. The individual's definition and encoding depend on the optimisation problem to be solved [43]. The individual might be presented as a string of binary numbers or real numbers. A fitness value assigned to each individual represents the objective function value associated with it. These individuals evolve in each generation and form a new population. The fittest individuals evolve inspired by the theory of evolution defined by Darwin. The selection of the individuals to evolve is facilitated by a biased selection, such as roulette-wheel or tournaments [44,45].

In addition to "individual", "population", and "fitness" terms used in the EA, other terms are used, such as "offsprings", "crossover", and "mutation" [7,43,45]. Offsprings are individuals from the population that are reproduced using crossover and mutation. In a crossover, parts of the individuals' genes are exchanged between two individuals. The crossover form depends on the number of exchange points along the individual's length, for example, one-point crossover, two-point crossover, and uniform crossover. In uniform crossover, multiple crossing points are defined along the individual's length. Mutation modifies one individual to reproduce an offspring. One or more parts of the individual are changed. Both crossover and mutation can be customised based on the optimisation problem.

The evolution of the individuals continues until a termination criterion is met, such as reaching a specified number of generations or stagnation [7,45]. Stagnation is defined when the best-found solution is not changed by the algorithm over a specific number of generations.

Evolutionary algorithms can handle optimisation problems in different domains. For example, an EA was used to minimise the service costs in marine container terminal operations [46]. Pasha et al. [47] compared an EA and other algorithms for solving the vehicle routing problem with a "factory-in-a-box" concept. The EA found the nearest solution to the optimum in small-scale problem instances and better solutions than other algorithms in large-scale problem instances. Evolutionary algorithms can detect promising regions in the search space and be utilised in the learning process while solving optimisation problems [48,49]. In online optimisation, the problems' features change over time and are unknown in advance. A learning-based EA was proposed to handle this problem [49]. Additionally, EAs were used to solve combinatorial optimisation problems, such as the vehicle routing problem and the travelling salesman problem [7,48].

Evolutionary algorithms can handle multi-objective optimisation problems [50]. In these problems, the non-dominated solutions, which outperform other solutions at least in one of the objective functions, are stored, and the algorithm looks for solutions along the Pareto front. Moradi [48] used a multi-objective evolutionary algorithm to minimise the number of vehicles and the total travelled distance in the vehicle routing problem. Researchers have utilised EAs in optimisation problems, including minimisation of energy consumption. For example, Ji et al. [51] solved a multi-objective green express cabinet assignment problem in urban last-mile logistics using probability-guided EA. They minimised total costs and energy consumption. The total makespan and total energy consumption were minimised in the flow shop scheduling problem with sequence-dependent setup times [52]. Petrowski and Ben-Hamida [43] stated that EAs for multi-objective problems provide promising solutions in problems with less than four objective functions. Other approaches proposed combining an EA and DES to solve optimisation problems, such as optimising lead time and total inventory holding costs in job sequencing and buffer size optimisation problems [53].

#### *3.2. Deep Reinforcement Learning*

In the field of reinforcement learning, the primary goal is to produce autonomous agents that interact with their environments to learn optimal behaviours, improving over time through trial and error. Although considerable successes in this field have been reported in the past [14,54,55], previous approaches often lacked scalability and were inherently limited to fairly low-dimensional problems [56]. In order to apply reinforcement learning to problems approaching real-world complexity, agents need to be able to derive efficient representations of the environment from high-dimensional inputs, and use these representations to generalize past experiences to new situations [15]. In recent years, improvements in the ability to process large amounts of data have led to considerable progress in this field [57]. A major reason for the development in this regard is the rise of deep learning, a class of representation learning with multi-layer artificial neural networks [58–60]. The neural networks are called deep, because they have numerous hidden layers between the input layer and the output layer and have, thus, an extensive internal structure [61–63]. Combining several layers, a deep neural network can be used to find compact low-dimensional representations in high-dimensional data, for example, in audio, text, and images. Hence, deep neural networks can, hence, be used to progressively build up abstract representations of data, and, thus, enable abstract learning. A particularly successful type of deep neural network is the Convolutional Neural Network (CNN) [15,64]. CNNs leverage the fact that the analyzed data are composed of smaller details—referred to as features—and trigger a decision about the entire data set by analyzing each feature in isolation. By using the mathematical concept of convolution, CNNs are able to learn patterns, for example, to associate object categories to raw image data. Applied in the

context of reinforcement learning, a CNN can be used to approximate the internal value function of the agent, and, thus, to map actions to constellations of data in a particular state [65]. In this regard, effective progress has been made in addressing the curse of dimensionality in the field of reinforcement learning [15,56]. Deep reinforcement learning is supposed to be able to scale to decision-making problems that were previously unsolvable, for example, settings with high-dimensional state and action spaces. Recent work in this field has shown outstanding progress, which started with an algorithm that could learn to play a range of Atari 2600 video games at super-human level, directly from screen pixels [65]. A comprehensive survey of efforts in deep reinforcement learning can be found in Reference [66].

#### **4. Solution Architecture**

#### *4.1. A Logistics Assistance System*

A logistics assistance system has been developed to assist decision-makers in distribution networks using the Python programming language [16]. This LAS utilises basic components related to a transactional system in a distribution network, such as Enterprise Resource Planning. Figure 1 shows the architecture of the LAS. The data are extracted from the transactional system and loaded into a data warehouse. KPIs are calculated, and any deterioration of the value of a KPI beyond the previously assigned limit triggers an alert in a corresponding key performance indicator management system (KPIMS) [67]. These systems recommend potential actions that are designed to improve the value of the KPI. A recommended action by the KPIMS is expected to increase the intrinsic value of the KPI that triggers the alert, but it could reduce the intrinsic value of other KPIs. Since each of the KPIMS recommends actions independently from the others, the LAS aims to consider the impact of the actions on the entire network and improve the network's performance as a whole.

**Figure 1.** The architecture of the developed Logistics Assistance System (LAS) [68].

In addition, the data are extracted from the transactional system by the model builder into a data model to form the database of a data-driven simulation tool (Figure 1). Rabe and Dross (2015) proposed the use of data-driven simulation tools to apply actions using SQL statements, such as SimChain, which is a generic supply chain simulation model based on the Siemens Plant Simulation tool [69]. The developers of the LAS used SimChain to model a similar distribution network previously [70].

In the LAS, actions are derived from action types [71], which are defined by simulation experts. An action type represents a generic description of an action, for example, centralising an SKU in a site without specifying the SKU and the site. To derive actions from the action types, input parameters are specified, for example, SKU 1 in site A. Action types are defined in the action type designer by decision-makers of the LAS using a domain-specific modelling language. The action types are stored in an action type directory that can be accessed by the KPIMS and the heuristic unit in the simheuristic framework (cf. Figure 1). The actions recommended by the KPIMS and the actions derived from action types form the search space of actions for the heuristic unit.

In the heuristic unit, the simheuristic approach is represented by simulation and a metaheuristic algorithm. The metaheuristic algorithm explores the search space and constructs action plans to be evaluated. The changes applied by actions are executed as SQL statements in the database. These data are then transferred to a "shadowed data warehouse", which mimics the calculation of the operational KPI, but for simulated data. The shadowed data warehouse is introduced in order to avoid any potential mix-up of the simulated (experimental) data with the real operational data sets. The (simulated) KPIs calculated from the shadowed data warehouse form the objective values to be optimised by the metaheuristic. The construction and evaluation continue until a termination condition is met. Then, suggested actions as an action plan are recommended to decision-makers, who are the main users of the LAS.

#### *4.2. Semantic Model for Action Types in Wholesale Logistics Networks*

To increase the LAS's flexibility and usability,decision-makers can model and integrate user-generated action types. For this purpose, decision-makers can utilize a specifically developed domain-specific modelling language, which is tailored to the model of action types in wholesale logistics networks. Accessing the developed domain-specific modelling language can be performed via the Action Type Designer, an integrated development interface (IDE), providing all benefits of common IDEs such as code completion or syntax highlighting.

All action types are based on the same semantic model. Therefore, the semantic model of an action type must be capable of representing all required information for all possible action types in a logistics network. Action types can be instantiated by adding type-specific parameters to the semantic model's attributes [72].

The attributes of the semantic model serve different purposes and are, thus, divided into different categories. The first category of attributes has informative purposes, such as the action type's *name* , *description*, *id*, the *owner's id*, and a list of ids for representing the involved modellers.

Action types represent changes to the underlying simulation model. For the specification of these changes, functional attributes are used. The attribute *input* is used to define the affected entities of the logistics network. To specify the effects on those entities, *statements* of the domain-specific modelling language are used.

Meta information of action types is stored in meta-attributes of the semantic model. For example, it may take some time to fully execute a set of actions in the real logistics network. Therefore, the required time for executing actions is stored in the attribute *time till effect*. Additionally, executing an action may entail costs. The costs of an action are stored in the attribute *t*otal costs.

The semantic model additionally includes domain-specific information [68,73,74], for example, the *frequency* or the *impact* of an action type on the logistics network, which can be stored in corresponding attributes of the model. Changes to the network can be categorised into two different groups, *structural* and *parametrical* changes. A structural change alters the structure of the network, for instance, adding new routes, sites, or SKUs to a site. Actions that affect attributes of the logistics network's entities are categorised as parametrical, for example, increasing the safety stock or changing the frequency of a route. In addition, *correlations* between different action types and their actions' impact on the network can be modelled. For example, when "centralising an assortment", "increasing the safety stock" of any centralised SKU might be a promising candidate for further actions. Thus, a positive relation is defined between "centralise" and "increase the safety stock".

After the parameterisation of an action type, the corresponding derived actions can be stored in the semantic model's attribute *actions*. An overview of the semantic model is given in Table 1.


**Table 1.** The semantic model for action types in wholesale logistics networks, based on Reference [72].

#### *4.3. Abstracting the Modelling of Action Types from the Underlying Simulation's Data Base*

Actions are closely related to the underlying simulation's data model, resulting in multiple issues. Modelling action types requires in-depth knowledge of the database's structure, for example, for specifying the areas of the database that are affected by applying corresponding actions. Another issue arises when the database's structure changes, for example, when the simulation software is updated or a new simulation tool is introduced. To address these problems, the authors propose to decouple the modelling of action types from the underlying simulation data model [75].

When applying an action, a set of corresponding entities of the logistics network needs to be adapted, accordingly. In a data-driven simulation, an entity can be described by entries in a database's table. To identify the correct entities, the table's name and the entities' attributes are defined as part of statements in the modelling process of an action type (Section 4.2). Thus, when applying an action, multiple entries in the database might be changed (Figure 2).

However, the simulation's database and, therefore, the data model, are typically predefined by the simulation tool that is being used. Thus, the data model is not easily adjustable. When decoupling the modelling of action types from the data model of the underlying simulation, an additional data model is required: an enterprise-specific data model. This enterprise-specific data model can be structured in the way that best suits the decision-makers' knowledge and needs.

Using an enterprise-specific data model, action types can be specified against this model and not against the simulation data model. To correctly convert actions into changes to the simulation data model, each attribute of the simulation data model must be distinctly linked to the corresponding attribute of the enterprise-specific data model (Figure 3). Such a mapping can be defined, for example, in the form of a JSON-file. In the process of executing an action, the information for the mapping between the two models is read from the JSON-file, so that changes can be applied to the simulation data model, accordingly [75].

**Figure 2.** Applying an action directly to the simulation's database.

**Figure 3.** Applying an action to the enterprise-specific data model and mapping the resulting changes to the simulation data model.

By utilizing this approach, action types can be modelled against an enterprise-specific data model, which can be adapted to the modellers' knowledge and needs. This approach allows decision-makers to become modellers and to specify action types against a common data structure and with known names of the entities and their attributes. Another advantage of this approach is that when the simulation data model changes, only the mapping must be adjusted and not all the definitions of the action types. This saves resources, reduces the risk of faults, and improves the acceptance of the method in general.

#### **5. Addressing the Performance Challenge: Deep Reinforcement Learning**

As mentioned above, creating an algorithm that is able to master a varied range of challenging tasks by teaching itself in a trial-and-error procedure is one of the major goals of reinforcement learning. Reinforcement learning in general considers tasks in which an agent interacts with an environment through a sequence of observations, actions, and rewards. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a challenge: they have to derive efficient representations of the environment from high-dimensional sensory inputs and use these to generalize past experiences to new situations (cf. Section 3.2). The authors of Reference [15] approached this challenge by creating an algorithm that used a Deep Q-Network (DQN) as the central value function approximation.

Almost all reinforcement learning algorithms are based on estimating value functions. The algorithm used for the DQN agent, Q-learning, is based on the Q-function, a function of state-action pairs that expresses how beneficial it is for the agent to perform a given action in a given state with respect to the expected return [76]. More formally, the Q-function expresses the value of taking an action *a* in a state *s* under a policy *π* with respect to the expected return starting from *s*, taking the action *a*, and thereafter following policy *π*. For small problems, the Q-function can be stored as a table, but for larger problems, this table quickly gets too large to be stored in memory. Moreover, the time and data required to fill the table accurately would be too high. In many tasks to which one would like to apply reinforcement learning, most states encountered will never have been experienced exactly in the same way before. The only way to learn anything at all in these cases is to generalize from previously experienced states to ones that have never been seen before. Hence, for larger problems, the key issue is that some sort of function approximation is included. In this regard, advances in deep neural networks have made it possible to approximate the Q-function, for example, with a CNN.

Mnih et al. [15] tested an agent with a DQN on the Arcade Learning Environment (ALE), which is a software framework designed to simplify the development of agents that play arbitrary Atari 2600 games and, therefore, offers a method to evaluate the development of general, domain-independent reinforcement learning agents [77]. Its source code is publicly available on the Internet [78]. Through the ALE, researchers have access to several dozen games through a single common interface. Eighteen actions can be input to the game via a digital joystick: three positions of the joystick for each axis, plus a single button. The DQN approach of Reference [15] outperformed the best existing reinforcement learning methods on 43 of the 49 games without incorporating any of the additional prior knowledge about the Atari 2006 games used by other approaches. In conclusion, the DQN algorithm trained itself and reached super-human performance just by using the game pixels as the observation and the game score as the reward signal from the environment.

The work by Mnih et al. [15] inspired the authors to test the DQN agent as a reinforcement learning approach to the performance challenge, as discussed earlier in this article (cf. Section 1). For the experiments, the general working principles of the DQN agent have been retained, but the parameters of this agent have been slightly adjusted. The implementation of the DQN has been built with the Python API for TensorFlow, an open-source software library for numerical computations using data flow graphs [79]. After the reinforcement learning agent applies an action to the database, the simulation model is instantiated and the simulation is run as described above. A reward calculation function generates the reward signal from the simulation output data by computing a scalar reward signal, using the changes in costs and performance. The reward signal is then routed back to the DQN

for training [80]. In order to express the state *s* of the logistics system configuration as an image, a feature extraction function selects the different features from the tables in the MySQL database and composes them into an image (Figure 4).

**Figure 4.** Graphical state representation (small image) and details of two exemplary image segments (large images) [80].

The image has been designed to look similar to an Atari game screen in order to make the problem accessible for the DQN agent. The general idea behind the design of the state representation as an image is to profit from the research regarding further domain-independent agents in the future.

Since the agent needs to learn a mapping from states to actions, the state representation also needs to encode information that enables for concluding from states to actions. For instance, if the agent is intended for learning to make a decision regarding the inventory, useful information to make such a decision, for example, inventory levels and customer demands, needs to be included in the state representation. For actions regarding, for example, machines, other information is needed in the state representation. Thus, the information needed in the state representation heavily depends on the action types available in the system (see Section 4.2). Consequently, the features that have to be selected from the MySQL database for the state *s* are derived from the action types used. In order to address the requirements regarding the scalability of the state representation, the state image is built from different image segments. Each segment corresponds to a segment type, similar to the previously explained relationship between actions and action types. The size of each segment is fixed. The actual segment size is derived from the size of the largest segment type. The overall design was chosen to enable the CNN to more easily identify patterns within the state data.

In order to generate a scalar reward signal for the reinforcement learning agent, a decrease in the total costs after taking an action is translated into a positive reward. An increase in the total costs, on the other hand, is translated into a negative reward. Furthermore, besides the bare costs, the difference in the logistics performance before and after applying the action is also incorporated into the reward. The authors have decided to define a penalty cost that is multiplied with the percentage change in the service level. If the service level decreases, a penalty is generated. If the service level increases, a bonus payment is generated. The service level costs are meant to express the loss of customer orders in the future due to unsatisfied customers, or the increase of customer orders from satisfied customers. The difference in the logistics costs and the service level costs are summed up and interpreted as the total costs caused by an action. Finally, these total costs are scaled down to generate the final reward signal, which is sent back to the agent. The scaling is done to get as close as possible to the architecture used in the original DQN implementation.

#### **6. Addressing the Performance Challenge: Exploiting Domain-Specific Information**

The performance of the LAS can be evaluated based on the impact of the recommended action plans on the distribution network and on the number of simulation runs. This section focuses on guiding the metaheuristic algorithm in the heuristic unit to find promising actions to be added to the action plan. The approach investigates the potential actions and explores information to guide the search. This information is called domain-specific information and is added to the action type definitions, such as the type of changes, success, and correlation (Section 4.2) [68,73,74].

#### *6.1. Utilizing the Characteristics of Action Type Classes*

The first illustrated DSI concerns the type of changes applied by an action. An action can change a parametrical value of an entity in a distribution network, for example, the stock level of an SKU at a site. Actions of this kind are called parametrical actions. For example, an action that increases the stock level of SKU 1 at site A is a parametrical action.

Other actions might add an entity to the network or delete an entity. These actions cause structural changes in the network, such as centralising SKU 1 at site A. These actions cause structural and parametrical changes. Centralising SKU 1 at site A is realised by several changes, for example, adding SKU 1 to site A if it is not at site A, removing SKU 1 from the other sites, establishing transport relations between site A and the other sites if not currently existing, and specifying the parametrical values of SKU 1 at site A. This action should define the parametrical values for the newly added entities, for example, SKU 1 at site A, and the transport relations between site A and the other sites.

Accordingly, actions are classified as structural or parametrical based on their type of changes. Structural actions cause significant changes in the network. Their changes delete the impact of previously applied parametrical actions if they affect the same entity in the network. For example, action *a*<sup>1</sup> increases the stock level of SKU 1 at site B, and *a*<sup>2</sup> centralises SKU 1 at site A. Action *a*<sup>2</sup> removes SKU 1 at site B from the network when it is applied. If *a*<sup>2</sup> follows *a*1, the change in the stock level caused by *a*<sup>1</sup> is removed by removing the entity by *a*2.

In order to consider the structural and parametrical changes in the selection of actions, the selection probability of an action is biased according to its type of changes and its position in an action plan. Structural actions are preferred at the beginning of an action plan and parametrical actions at its end. Figure 5 shows a proposed probability distribution of actions' types of changes that changes linearly along the length of an action plan, *l*. In Equations (1) and (2), *prob<sup>S</sup> <sup>i</sup>* is the selection probability of structural actions at position *i* of an action plan. If structural changes are selected, a structural action is selected from the potential structural actions.

$$prob\_i^S = prob\_1^S - \frac{prob\_1^P - prob\_1^S}{l - 1}(i - 1), \ i = 1, 2, \dots, l \tag{1}$$

$$prob\_i^P = 1 - prob\_i^S.\tag{2}$$

**Figure 5.** Probability distribution of actions' types of changes along the length of an action plan.

#### *6.2. Building on Success Experience*

Another DSI is the success of actions in improving the performance of a distribution network. The term success is defined with respect to an action. The impact of an action on a network is investigated to determine whether it increases or decreases the performance measures. For example, Figure 6 shows actions, *a*<sup>1</sup> and *a*2, that increase the service level and an action, *a*4, that reduces the service level. Actions that increase the service level get a higher success value than the actions that decrease the service level. Within the actions that increase the service level, their impact of the actions varies, and accordingly, their success value varies, such as *a*<sup>1</sup> and *a*<sup>2</sup> in Figure 6.

**Figure 6.** Impact of actions on the service level.

Actions with a high success value tend to increase the performance of the network, and hence, increasing their selection probability might guide the metaheuristic algorithm to construct promising action plans. Actions with a high success value get a higher selection probability than actions with a lower success value. For example, the selection probabilities of the actions in Figure 6 might be assigned as 0.32, 0.28, 0.24, and 0.16 to actions *a*1, *a*2, *a*3, and *a*4, respectively.

While the success values of concrete actions can only be measured for the single and parametrized actions themselves, these values cannot be expected to contribute to the performance improvement, directly, because there is a low probability that the specific current action has already been evaluated before and, thus, owns a success attribute. Therefore, success may have to be regarded only indirectly: Measured success values of actions are summarized for the respective action types, and this success attribute of the action types is then used to steer the construction of action plans. With this procedure, action types are preferred if the associated actions have often led to improvements in the past.

#### *6.3. Defining Relations between Actions*

The last DSI that we propose is the correlation between two actions and their impact on the performance of the network. In this context, the joint impact of a sequence of two actions on the performance of a network is compared to their expected impact if they are applied individually. In Figure 7, the expected impact of applying actions *a*<sup>1</sup> and *a*<sup>2</sup> is an increase of 0.19% in the service level, given by the summation of the impact of the single actions *a*<sup>1</sup> (0.12%) and *a*<sup>2</sup> (0.07%). Actually, the sequence *a*<sup>1</sup> followed by *a*<sup>2</sup> increases the service level by only 0.15%. Thus, applying these two actions as [*a*1, *a*2] has a negative influence ("−"). The sequence [*a*2, *a*1] causes an increase in the service level by 0.28%. Thus, it has a positive impact ("+") compared to the expected impact of the single actions. Another example of the positive impact is the sequence [*a*1, *a*3]. If the expected impact of a sequence does not differ from the actual impact, the relation is a weak relation ("∼"), for example, sequence [*a*2, *a*3] in Figure 7.

The correlation between actions could be tabulated in a correlation matrix. Rows represent the first applied action in the sequence, and columns represent the second applied action. The cells show the relation between the impact of a sequence of actions and their expected impact on the network as "+", "∼", and "−" for positive, weak, and negative relations, respectively. The relations extracted from Figure 7 are tabulated in Table 2. The rest of the correlation matrix shows relations between other actions in the search space.

**Table 2.** Correlation matrix example.


The cell between *a*<sup>1</sup> and *a*<sup>1</sup> is a duplication of an action. An action can be duplicated by applying it twice. If an action increases the stock level of an SKU at a site by 10 units, its duplication increases the stock level by 20 units. Accordingly, the performance of the network is affected and might vary from the expected impact.

Similar to the previous section, applying actions affects the performance of the network, and their impact might be used to guide the selection of actions to construct promising action plans. The selection probability of actions is increased if the actions are in a positive relation with already selected actions. For example, if action *a*<sup>2</sup> is selected, then the recommended action is *a*<sup>1</sup> to increase the service level (Table 2). Selecting *a*<sup>2</sup> does not increase the service level as expected, and selecting *a*<sup>3</sup> does not influence the expected increase in the value. Thus, the selection probability of *a*<sup>1</sup> becomes higher than *a*3.

#### **7. Addressing the Performance Challenge: Reducing the Number of Evaluations**

Another performance measure of the LAS is the number of simulation runs, which we define to be the number of objective function evaluations. In order to reduce the number of evaluations of the objective function, a selective evaluation might be performed, or evaluations might be skipped. In this research, the evaluation of the impact of an action plan is skipped if the action plan has previously been evaluated [81].

Previously evaluated action plans might be identical to newly formed action plans, or they can be equivalent to them. Equivalent action plans have an identical impact on the performance of the distribution network, but are not identical action plans. Figure 8 shows an example of equivalent action plans. These action plans cause the same changes in the network and, hence, result in the same impact on the performance. The performance of an equivalent action plan can directly be used, without again evaluating the performance of the newly formed plan.

**Figure 8.** Equivalent action plans: (**a**) action plans with interchangeable actions and (**b**) action plans with redundant actions.

In order to identify equivalent action plans we define interchangeable and redundant actions. Interchangeable actions can be reordered in an action plan without affecting the impact of the action plan on the performance of the network (Figure 8a). Actions *a*<sup>1</sup> and *a*<sup>2</sup> in Figure 8a have been reordered without affecting the applied changes and the impact of the action plan on the service level; *a*<sup>1</sup> and *a*<sup>2</sup> are interchangeable actions. Actions that affect different entities in the network can be reordered without affecting the impact of the action plan. These actions are interchangeable if they do not interfere with an overall parameter of the network, such as the capacity of a site. For example, *a*<sup>1</sup> increases the stock level of SKU 1 at site A, and *a*<sup>2</sup> increases the stock level of SKU 2 at site A. Both actions affect different entities in the network, SKU 1 at site A and SKU 2 at site A. Applying the changes by any of the actions does not affect the changes applied by the other action. However, if the total capacity of site A reaches its limit after applying any of these actions, the resulting applied changes differ when the actions are swapped.

Redundant actions can be removed from the action plan without affecting the impact of the action plan on the distribution network (Figure 8b). Action *a*<sup>5</sup> has been removed from the action plan in Figure 8b. The resulting action plan applies the same changes as the original action plan, without removing *a*5, and the service remains unchanged. Action *a*<sup>5</sup> is a redundant action, because its removal has no effect on the changes applied in the network. These actions can be duplicated structural actions. A structural action adds or removes entities from the network, and its duplication repeats these changes without causing additional changes in the network. A duplicated parametrical action that causes incremental changes is not redundant, because it causes an incremental increase or decrease in the value of the affected parameter.

Defining the equivalent action plans based on interchangeable and redundant actions enables skipping the evaluation of an action plan. An action plan is rated as its equivalent action plan that was evaluated previously.

Another approach reduces the number of actions to reduce the number of evaluations. In this approach, actions are grouped. The grouped actions replace actions and form a smaller search space of actions to be explored. For example, if two actions are grouped, the search space of actions can be reduced from 100 actions to 50 actions. Exploring the search space of 50 actions obviously requires a smaller number of evaluations.

The grouping criteria might be defined by the decision-maker, for example, grouping actions that affect SKUs based on an SKU assortment at a site [17]. For example, a distribution network has two sites, site A and site B. Assortment 1 includes SKU 1 and SKU 2, and assortment 2 includes SKU 3 and SKU 4. Then, action *a*<sup>1</sup> that increases the stock level of SKU 1 at site A and action *a*<sup>2</sup> that increases the stock level of SKU 2 at site A are grouped to form a new action *a*∗ <sup>1</sup> that increases the stock level of SKUs in assortment 1 at site A. Similarly, action *a*<sup>3</sup> that increases the stock level of SKU 1 at site B and *a*<sup>4</sup> that increases the

stock level of SKU 2 at site B are grouped to form the new action *a*∗ <sup>2</sup> that increases the stock level of SKUs in assortment 2 at site B. The reduction in the number of actions depends on the number of actions assigned to an assortment. As a result, the optimisation algorithm explores a smaller number of actions.

In order to construct an action plan from the grouped actions, the number of selected grouped actions is almost certainly lower than the length of the action plan. Four actions are selected to form an action plan of length four. If two actions are grouped, they form an action plan of length four. Therefore, this reduction in the number of selected actions further reduces the number of potential action plans and the number of evaluations.

#### **8. Evaluation and Results**

In order to test the proposed approaches, we used a database from a distribution network of an international material trading company. The database was extracted from their enterprise resource planning system and was verified and validated according to a procedure model presented by Rabe et al. [82]. In this research, we filtered this database down to a subset of five sites and 30 SKUs. This database represents the database in Figure 1. Then, we adapted the *evolutionary algorithm* to utilise the actions' DSI. This information is assigned to the action types, and from there then it is inherited to the derived actions from the action types. The selection probability of actions is changed based on their DSI. For example, considering the actions' type of changes, the actions are classified as structural and parametrical actions. The selection probability of these classes is changed according to Equations (1) and (2). An action is selected randomly from the respective selected class. Thus, the selection probability of actions is biased based on their type of changes.

We compared a completely random selection of actions with a biased selection of actions in the construction of action plans in the EA's initial generation. The number of generations that were required to stagnate and the quality of the corresponding recommended action plans were recorded for the comparison. The quality of the action plans has been evaluated based on the costs and the service level. Fifty individuals form a generation in the EA. The crossover and mutation probabilities were set to 0.8 and 0.3, respectively. The crossover forms, CR, were the one-point crossover, the two-point crossover, and the uniform crossover. In mutation, MU, one randomly selected action was replaced, or multiple actions were selected to be replaced. Each experiment setup was run ten times. Table 3 records the average performance and demonstrates the gap analysis of the difference between random selection and the biased selection of actions.


**Table 3.** Comparison between random selection of actions and biased selection in an evolutionary algorithm (EA) experiment using the gap analysis (CR = crossover form, MU = mutation form).

Next, we have tested the approach that exploits the definition of equivalent action plans. In this experiment, the EA selected actions randomly, and the evaluation of an action plan was skipped if an equivalent action plan was previously evaluated. Figure 9 shows the time distribution during an experiment run when the crossover and the mutation probabilities were set to 0.8 and 0.3, respectively. The number of simulation runs decreased by more than 15%.

**Figure 9.** The time distribution in (**a**) a reference EA experiment and (**b**) an EA experiment utilising the equivalent action plans approach.

For the experimentation with the *DQN agent*, the authors assumed that one training episode for the reinforcement learning agent consists of taking three actions. Hence, once the three actions have been taken, the simulation is reset to its initial state and the agent can start a new training episode. The hyper-parameters of the architecture have been kept mostly the same for all the experiments. However, further architectures for the internal DQN were tested. For each architecture, the reinforcement learning agent was trained with at least 1000 episodes, each consisting of taking three actions, which resulted in at least 3000 evaluative simulation runs for each architecture. In most of the cases, the agent was able to gradually learn to take the best three actions possible for the initial state of the logistics network after about half of the training episodes. By conducting these experiments, the authors could demonstrate that a general reinforcement learning agent was able to generate action plans just from its trained internal value function approximation. Hence, in general, the presented approach can be used to reduce the response time *TD* (cp. Section 1). However, as a drawback, the total runtime of each experiment took several days. As expected, the bottleneck in terms of computing performance was not the back-propagation through the CNN, but the simulation time needed for each evaluative simulation run. Although the experiments have been performed on relatively small test models of a logistics network, the results of the experiments with the DQN implementation were promising, since they showed that a general purpose reinforcement learning agent can in fact be trained to optimize a logistics network model solely from a state representation, a reward signal, and the available actions types.

#### **9. Discussion**

The approaches explained in Sections 4 and 5 aim to improve the performance of the LAS. This performance is evaluated based on the performance of the simheuristic's components – metaheuristics and simulation. The quality of found solutions is a major performance measure to evaluate the metaheuristics used to solve an optimisation problem [7]. Additionally, the run time of the algorithm is used to evaluate it [83]. In this research, we used the objective values of the found solutions and the number of simulation runs to evaluate the recommended approaches. The number of simulation runs represents the number of objective function evaluations, which is an indicator of the run time of an algorithm to recommend a solution [27].

Researchers proposed approaches to improve the performance of the metaheuristics by focusing on the analysis of the search space. For example, Bode et al. [41] screened solutions before their evaluation. Karimi et al. [84] clustered the problem's search space before exploring it, and Ku and Arthanari [85] replaced the search space with a smaller space. Furthermore, researchers have filtered the solutions to reduce the number of objective function evaluations, such as Cai et al. [86] and Alsheddy et al. [13].

Other approaches have examined the search algorithms. They investigated improvements to the local search, such as Alsheddy et al. [13], who aimed to escape from a local optimum by introducing penalties. Other researchers have investigated machine learning

and data mining to learn from the patterns [87,88]. Amaran et al. [89] stated that problem information could be used to construct initial promising solutions as an input to the optimisation algorithm. Forming these solutions by selecting its parts randomly based on problem information is called randomly biased selection [42].

In Section 6, we introduced an approach to guide the search of a metaheuristic algorithm. In this approach, DSI is used to prioritise which actions to select while constructing action plans. The experiments utilising the type of changes to construct action plans showed promising results for identifying initial solutions for the algorithm. We used the Mann-Whitney U test [90] to compare our approach to random selection of actions. The null hypothesis in comparing the random selection and the biased selection is rejected concerning decreasing costs, increasing the service level, and reducing the number of generations to stagnate at *p*-values of 0.0961, 0.0000, and 0.0000, respectively. These results suggest a significant impact of DSI on the performance of the metaheuristics of the LAS, and our claims of recommending better solutions' quality and decreasing the number of generations to stagnate are accepted.

The domain-specific information can be used to alter the variation operators of the found action plans in the subsequent iterations in a metaheuristic algorithm. In our research, these operators are represented as modified crossover and mutation. Because initial solutions based on the type of changes helped the algorithm to recommend better solutions than the random selection of actions, we expect to get a similar effect in implementing them in crossover and mutation. Combining different DSI is a field for further investigation.

In addition, we propose an approach to reduce the number of simulation runs by defining equivalent solutions. This approach keeps a list of evaluated solutions and increases the EA's computational time a bit; but this is compensated by a reduction in the number of simulation runs (Figure 9). The memory usage might be overcome if the algorithm can have access to a table where this list of evaluated solutions is stored.

In conclusion, simulation runs consume a large portion of the computational time of the algorithm run in a simheuristic approach. Our approach reduced this percentage and enabled the algorithm to recommend solutions in shorter time.

#### **10. Summary and Outlook**

In this research, we have developed a logistics assistance system to support decisionmaking in material trading networks. Decision-makers face a challenging task in selecting actions to improve the performance of the networks. The developed LAS is based on a simheuristics framework that combines simulation and metaheuristics. For the metaheuristics, we have studied reinforcement learning and evolutionary algorithms. Additionally, we have proposed approaches to address performance challenges of the LAS that are represented as the quality of recommended actions and the number of simulation runs. Our approaches are based on utilising domain-specific information, reducing the number of actions, and defining equivalent solutions.

We have developed a suitable implementation and used a subset of real-world data to evaluate our LAS. Our results show that reinforcement learning requires a significant training time. However, after learning, it recommends promising actions upon request. The domain-specific information approach guides the search of an optimisation algorithm to select promising actions, and hence, improves the performance of the LAS. Defining equivalent action plans has reduced the number of simulation runs by up to 15%.

Our approaches are limited to distribution networks. We defined and tested them on a distribution network of an international trading company. In other networks or applications, the specific definition of the domain-specific information as well as of the equivalency of action plans should be studied.

For further research, we investigate other domain-specific information to guide the selection of actions in the LAS. Taking advantage of combining reinforcement learning and the evolutionary algorithm is another approach to be investigated. This combination requires the study of the schema to integrate both approaches. Furthermore, new parameters might become necessary to be defined and initialised.

**Author Contributions:** Conceptualization and research supervision, M.R.; General methodology of LAS and implementation of decision support system, F.D.; Development of domain-specific description language, D.S.; Development and evaluation of reinforcement learning approach, F.D.; Development and evaluation of domain-specific biasing methods, M.A.; Design of paper M.R. and M.A.; Original draft preparation, M.A.; Editing, M.R. and M.A.; Supervision, M.R. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was partially funded by the German University of Jordan, the Graduate School of Logistics in Dortmund (Germany) and by thyssenkrupp Materials International GmbH.

**Data Availability Statement:** This research has been based on real enterprise data taken from the IT systems of a German company. These data have been handed out under strict NDA and experience specific procedures to ensure their classification, which unfortunately hinders their free release.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Algorithms* Editorial Office E-mail: algorithms@mdpi.com www.mdpi.com/journal/algorithms

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18