**1. Introduction**

Modern supply chains are complex systems that interconnect the globe. Efficient supply chains are able to control costs and ensure delivery to customers with minimal delays and interruptions. Inventory managemen<sup>t</sup> is a key component in achieving these goals. Higher inventory levels allow for suppliers to maintain better customer service levels, but they come at a higher cost, which often gets passed on to their customers and, ultimately, to the end consumers. This is particularly the case for perishable items that have a limited shelf life and can go to waste if the inventory exceeds demand. Thus, every participant in the supply chain has an incentive to find the appropriate balance in inventory levels to maximize profitability and maintain market competitiveness. Efficient supply chains are able to coordinate material flows amongs<sup>t</sup> its different stages to avoid the "bullwhip effect", whereby over corrections can lead to a cascading rise or fall in inventory, having a detrimental impact on the supply chain costs and performance [1].

Extensive literature exists in supply chain and inventory management. Relevant review papers in the area of inventory optimization include those of Eruguz et al. [2] and Simchi-Levi and Zhao [3]. The inventory managemen<sup>t</sup> problem (IMP) that is presented in this work is built upon the problem structure presented in Glasserman and Tayur [4], which presents a single-product, multi-period, serial capacitated supply chain with production and inventory holding locations at each echelon. In their work, Glasserman and Tayur [4] use

**Citation:** Perez, H.D.; Hubbs, C.D.; Li, C.; Grossmann, I.E. Algorithmic Approaches to Inventory Management Optimization. *Processes* **2021**, *9*, 102. https://doi.org/ 10.3390/pr9010102

Received: 1 December 2020 Accepted: 30 December 2020 Published: 6 January 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

infinitesimal perturbation analysis (IPA) in order to determine optimal base stock levels in an order-up-to policy by optimizing over a sample path of the system.

Other approaches for solving the IMP have been reported in the literature. Chu et al. [5] use agent-based simulation-optimization on a multi-echelon system with an (*r*, *Q*) inventory policy. Expectations are determined via Monte Carlo simulation. Improvements are only accepted after passing a hypothesis test to mitigate the effect of noise on the improvement. Two-stage stochastic programming (2SSP) is used to optimize small supply chains in the works by Dillon et al. [6], Fattahi et al. [7], and Pauls-Worm et al. [8]. The studied supply chains are either single or two-echelon chains with centralized or decentralized configurations, a single perishable or unperishable product, and (*r*, *S*) or (*s*, *S*) policies. Zahiri et al. [9] present a multi-stage stochastic program (MSSP) for a four-level blood supply network with uncertain donation and demand. The model is reformulated and solved while using a hybrid multi-objective meta-heuristic. Bertsimas and Thiele [10] apply robust optimization to both uncapacitated and capacitated IMP. However, production capacity is not explicitly included. Their models are solved with linear programming (LP) or mixed-integer linear programming (MILP), depending on the usage of fixed costs. The reader is referred to Govindan and Cheng [11] for a review of robust optimization and stochastic programming approaches to supply chain planning.

Additionally, there have been a number of efforts to optimize multi-echelon supply chain problems via dynamic programming (DP). A neuro-dynamic programming approach was developed by Roy et al. [12] in order to solve a two-stage inventory optimization problem under demand uncertainty to reduce costs by 10% over the benchmarked heuristics. Kleywegt et al. [13] formulate a vendor managed inventory routing problem as a Markov Decision Process (MDP) and develop an approximate dynamic programming (ADP) method to solve it. Topaloglu and Kunnumkal [14] develop a Lagrangian relaxationbased ADP to a single-product, multi-site system to manage inventory for the network that outperforms a linear programming method used in the benchmark. Kunnumkal and Topaloglu [15] use ADP to develop stochastic approximation methods to compute optimal base-stock levels across three varieties of inventory managemen<sup>t</sup> problems: a multi-period news vendor problem with backlogs and lost sales, and an inventory purchasing problem with uncertain pricing. Cimen and Kirkbride [16] apply ADP to a multi-factory and multi-product inventory managemen<sup>t</sup> problem with process flexibility. They find that, in most scenarios, the ADP approach finds a policy within 1% of the optimal DP solution in approximately 25% of the time. Additional resources on supply chain managemen<sup>t</sup> with DP and ADP is provided by Sarimveis et al. [17].

Reinforcement learning has also been applied to IMPs in recent years. Mortazavi et al. [18] use Q-learning for a four-echelon IMP with a 12 week cycle and non-stationary demand. Oroojlooyjadid et al. [19] train a Deep Q-Network in order to play the Beer Game—a classic example of a multi-echelon IMP—and achieve near optimal results. Kara and Dogan [20] use Q-learning and SARSA to learn stock-based replenishment policies for an IMP with perishable goods. Sultana et al. [21] use a hierarchical RL model to learn re-order policies for a two-level multi-product IMP with a warehouse and three retailers.

We extend the problem in Glasserman and Tayur [4] to general supply networks with tree topologies. Our focus is not on finding optimal parameters for static inventory policies, but rather to determine and compare different dynamic policy approaches to the IMP. We build on the previous works in the literature by exploring the IMP while using different approaches and discuss their relative merits and drawbacks. The approaches studied include

1. A deterministic linear programming model (DLP) that uses either the rolling horizon or shrinking horizon technique in order to determine optimal re-order quantities for each time period at each node in the supply network. Customer demand is modeled at its expectation value throughout the rolling/shrinking horizon time window.


We build off of the work of Hubbs et al. [22] by extending the IMPs presented therein in order to address multi-echelon problems with multiple suppliers at each echelon, and contribute new environments to the open-source OR-Gym project (See https://www.github. com/hubbs5/or-gym). The intial version of the IMP in the OR-Gym project was limited to serial multi-echelon systems and it did not include multi-stage stochastic programming models for reorder policy optimization. The library was thus generalized in order to simulate and optimize supply networks with tree topologies under uncertain demand, while using the dynamic reorder policies mentioned above.

#### **2. Materials and Methods**

#### *2.1. Problem Statement*

In this work, we focus on the multi-echelon, multi-period, single-product, and singlemarket inventory managemen<sup>t</sup> problem (IMP) in a make-to-order supply network with uncertain stationary demand. The base case supply network has a tree topology with four echelons, as shown in Figure 1. The different sets that are used for the nodes in the base case network are designated in the figure's legend (raw material, *Jraw*; main, *J*; retail, *Jretail*; distributor, *Jdist*; producer, *J prod*; and, market nodes, *Jmarket*).

**Figure 1.** Supply Chain Network Schematic.

#### *2.2. Sequence of Events*

The sequence of events in each period of the IMP simulation environment occurs, as follows,


have transpired). The lead times between stages include both production times and transportation times.

	- (a) Unfulfilled sales are backlogged at a penalty. Backlogged sales take priority in the following period.
	- (b) Unfulfilled sales are lost and a goodwill loss penalty is levied.

#### *2.3. Key Variables*

Table 1 describes the main variables used in the IMP model. All of the key variables are continuous and non-negative.


**Table 1.** Main variables in the inventory managemen<sup>t</sup> problem (IMP) Model.

#### *2.4. Objective Function*

The objective of the IMP optimization is to maximize the time-averaged expected profit of the supply network ( *R*, Equation (1)). The uncertain parameter vector that is associated with the demand in period *t* is given by *ξ<sup>t</sup>*. A specific realization of the uncertain parameter is denoted with *ξ<sup>t</sup>*. The sequence of uncertain parameters from period *t* through *t* is represented with *ξ***[***<sup>t</sup>***,***t-***]**, with *ξ*[*<sup>t</sup>*,*<sup>t</sup>*] being a specific realization of that sequence (note: *ξ*[1,1] refers to stage 1, which is deterministic). The present formulation assumes a single retailer– market link with a single demand in each period. The profit in each period is the sum of the profits in the main network nodes ( *R*1 = ∑*j*∈*<sup>J</sup> <sup>R</sup>*1,*j* and *Rt* = ∑*j*∈*<sup>J</sup> Rt*,*j*(*ξt*) ∀*t* ∈ *T*). The main network nodes include the production/manufacturing, distribution, and retail nodes.

$$\max R = \frac{1}{|T|} \cdot \left( R\_1 + \mathbb{E}\_{\mathbb{S}[\mathbf{2}, |T|]} ^{\mathbb{E}\_{[\mathbf{1}, \mathbf{1}]}} \left[ \max R\_2(\boldsymbol{\xi}\_2) + \dots + \mathbb{E}\_{\mathbb{S}[\mathbf{1}, |T|]} ^{\mathbb{E}\_{[\mathbf{1}, \mathbf{1}]}} \left[ \max R\_{|T|}(\boldsymbol{\xi}\_{|T|}) \right] \right) \right) \tag{1}$$

#### *2.5. IMP Model*

The dynamics of the IMP under demand uncertainty are modeled as a Linear Programming (LP) problem using the linear algebraic constraints that are given below.

#### 2.5.1. Network Profit

The profit in period *t* at node *j* (*Rt*,*j*, Equations (2a) and (2b)) is obtained by subtracting procurement costs (*PCt*,*j*, Equations (4a) and (4b)), operating costs (*OCt*,*j*, Equations (5a) and (5b)), unfulfilled demand penalties (*UPt*,*j*, Equation (6)), and inventory holding costs (*HCt*,*j*, Equations (7a) and (7b)) from the sales revenue (*SRt*,*j*, and Equations (3a) and (3c)). The operating costs refer to production costs and, hence, do not apply to distribution nodes, as no manufacturing occurs at these nodes. There is also no unfulfilled demand at non-retail nodes, as all inter-network requests must be feasible.

In the equations shown below, the parameters *pj*,*k*, *bj*,*k*, and *gk*,*j* refer to the material unit price, the unfulfilled unit demand penalty, and the unit material pipeline holding cost (transportation cost) for the link going from node *j* to node *k*, respectively. *oj*, *<sup>ν</sup>j*, and *hj* refer to the unit operating cost, production yield (0 to 1 range), and on-hand inventory holding cost at node *j*, respectively. The sets *Jinj* and *Jou<sup>t</sup> j* are the sets of predecessors and successors to node *j*, respectively.

$$R\_{1,j} = SR\_{1,j} - PC\_{1,j} - OC\_{1,j} - lIP\_{1,j} - HC\_{1,j} \quad \forall j \in J \tag{2a}$$

$$R\_{l,j}(\xi\_l) = SR\_{l,j}(\xi\_l) - PC\_{l,j}(\xi\_l) - OC\_{l,j}(\xi\_l) - lIP\_{l,j}(\xi\_l) - HC\_{l,j}(\xi\_l) \quad \forall t \in \{2, \dots, |T|\}, j \in \{1, \dots, l\}$$

$$SR\_{1,j} = \sum\_{k \in f\_i^{out}} p\_{j,k} \cdot a\_{1,j,k} \quad \forall j \in J^{prod} \cup J^{dist} \tag{3a}$$

$$SR\_{t,j}(\mathbb{J}\_t) = \sum\_{k \in I\_i^{out}} p\_{j,k} \cdot a\_{t,j,k}(\mathbb{J}\_t) \quad \forall t \in \{2, \ldots, |T|\}, j \in J^{prod} \cup J^{dist} \tag{3b}$$

$$SR\_{t, \vec{j}}(\xi\_t) = \sum\_{k \in \mathcal{J}\_i^{\text{cut}}} p\_{j,k} \cdot S\_{t, \vec{j},k}^d(\xi\_t) \quad \forall t \in \{2, \dots, |T|\}, j \in \mathcal{J}^{\text{retail}} \tag{3c}$$

$$PC\_{1,j} = \sum\_{k \in J\_i^{in}} p\_{k,j} \cdot a\_{1,k,j} \quad \forall j \in J \tag{4a}$$

$$PC\_{t, \vec{\jmath}}(\xi\_t) = \sum\_{k \in I\_i^{in}} p\_{k, \vec{\jmath}} \cdot a\_{t, k, \vec{\jmath}}(\xi\_t) \quad \forall t \in \{2, \ldots, |T|\}, j \in J \tag{4b}$$

$$\text{OC}\_{1,j} = \frac{o\_j}{\nu\_j} \cdot \sum\_{k \in \mathcal{J}\_j^{out}} a\_{1,j,k} \quad \forall j \in J^{prod} \tag{5a}$$

$$\text{OC}\_{t,\dot{\jmath}}(\xi\_t) = \frac{o\_{\dot{\jmath}}}{\nu\_{\dot{\jmath}}} \cdot \sum\_{k \in J\_{\dot{\imath}}^{\text{out}}} a\_{t,\dot{\jmath},k}(\xi\_t) \quad \forall t \in \{2, \dots, |T|\}, j \in J^{\text{prod}} \tag{5b}$$

$$\text{MLP}\_{t,\mathfrak{j}}(\xi\_t) = \sum\_{k \in f\_{\mathfrak{j}}^{\text{out}}} b\_{j,k} \cdot u\_{t,\mathfrak{j},k}(\xi\_t) \quad \forall t \in \{2, \dots, |T|\}, j \in J^{\text{trail}} \tag{6}$$

$$HC\_{1,j} = h\_{\dot{\jmath}} \cdot \mathbb{S}\_{1,j}^{o} + \sum\_{k \in \mathcal{J}\_{j}^{in}} \mathbb{g}\_{k,j} \cdot \mathbb{S}\_{1,k,j}^{p} \quad \forall j \in J \tag{7a}$$

$$HC\_{t, \mathbf{j}}(\xi\_t) = h\_{\mathbf{j}} \cdot S\_{t, \mathbf{j}}^o(\xi\_t) + \sum\_{k \in I\_{\mathbf{j}}^{in}} \mathbf{g}\_{k, \mathbf{j}} \cdot S\_{t, k, \mathbf{j}}^p(\xi\_t) \quad \forall t \in \{2, \dots, |T|\}, j \in J \tag{7b}$$

#### 2.5.2. Inventory Balances

The on-hand inventory at each node is updated while using material balances that account for incoming and outgoing material, as shown in Equations (8a)–(9c). The inventory levels at each node are updated by adding any incoming inventory and subtracting any outgoing inventory to the previously recorded inventory levels at the respective nodes. The parameter *So*0,*j* is the initial inventory at node *j*. The variable *<sup>a</sup>t*,*k*,*<sup>j</sup>* is the pipeline inventory that arrives at node *j* from node *k* in period *t* (Equation (10)). Outgoing inventory is the

inventory that is transferred to downstream nodes, *at*,*j*,*k*, or sold to the market at the retailer node, *<sup>S</sup>dt*,*j*,*k*. At production nodes, the sales quantities are adjusted for production yields (*<sup>ν</sup>j*). For distribution nodes, *νj* is set to 1.

$$S\_{1,j}^o = S\_{0,j}^o + \sum\_{k \in I\_j^{in}} a\_{1,k,j}^{\prime} - \frac{1}{\nu\_j} \cdot \sum\_{k \in I\_j^{out}} a\_{1,j,k} \quad \forall j \in J^{prod} \cup J^{dist} \tag{8a}$$

$$S\_{2,j}^{o}(\xi\_2) = S\_{1,j}^{o} + \sum\_{k \in I\_j^{in}} a\_{2,k,j}^{\prime} - \frac{1}{\nu\_j} \cdot \sum\_{k \in I\_j^{out}} a\_{2,j,k}(\xi\_2) \quad \forall j \in J^{prod} \cup J^{dist} \tag{8b}$$

$$S\_{l,j}^{0}(\xi\_{l}) = S\_{l-1,j}^{0}(\xi\_{l-1}) + \sum\_{k \in I\_{j}^{\mathrm{int}}} a\_{l,k,j}^{\prime} - \frac{1}{v\_{j}} \cdot \sum\_{k \in I\_{j}^{\mathrm{int}}} a\_{l,jk}(\xi\_{l}) \quad t \in \{3, \ldots, |T|\} \\ \forall j \in J^{\mathrm{prod}} \cup J^{\mathrm{dist}} \qquad \text{(8c)}$$

$$S\_{1,j}^{o} = S\_{0,j}^{o} + \sum\_{k \in J\_j^{in}} a\_{1,k,j}^{\prime} \quad \forall j \in J^{retail} \tag{9a}$$

$$S^{o}\_{2,j}(\mathbb{X}\_2) = S^{o}\_{1,j} + \sum\_{k \in f^{i\rm in}\_j} a'\_{2,k,j} - \sum\_{k \in f^{\rm out}\_j} S^{d}\_{2,j,k}(\mathbb{X}\_2) \quad \forall j \in J^{retail} \tag{9b}$$

$$S\_{t, \boldsymbol{\eta}}^{o}(\mathcal{J}\_{t}) = S\_{t-1, \boldsymbol{\eta}}^{o}(\mathcal{J}\_{t-1}) + \sum\_{k \in f\_{\boldsymbol{\eta}}^{in}} a\_{t, k, \boldsymbol{\eta}}^{\prime} - \sum\_{k \in f\_{\boldsymbol{\eta}}^{out}} S\_{t, \boldsymbol{\eta}, k}^{d}(\mathcal{J}\_{t}) \quad \forall t \in \{3, \ldots, |T|\}, j \in \boldsymbol{f}^{train} \tag{9c}$$

$$a'\_{t,k,j} = \begin{cases} 0, & \text{if } t - L\_{k,j} < 1 \\ a\_{1,k,j}, & \text{if } t - L\_{k,j} = 1 \\ a\_{1 - L\_{k,j}, k, j}(\mathbb{Z}\_{t - L\_{k,j}}), & \text{if } t - L\_{k,j} > 1 \end{cases} \forall t \in T, j \in J^{\text{retail}}, k \in l\_j^{\text{in}} \tag{10}$$

Equations (11a)–(11c) provide the pipeline inventory balances at each arc. Once again, inventories are updated by deducting delivered inventory downstream and adding new inventory requests to the previously recorded pipeline inventory levels. It is assumed that, at *t* = 0, there is no inventory in the pipeline.

$$S\_{1,k,j}^p = -a\_{1,k,j}' + a\_{1,k,j} \quad \forall j \in J, k \in J\_j^{in} \tag{11a}$$

$$S\_{2,k,j}^{p}(\mathbb{J}\_2) = S\_{1,k,j}^{p} - a\_{2,k,j}' + a\_{2,k,j}(\mathbb{J}\_2) \quad \forall j \in \mathbb{J}, k \in J\_j^{in} \tag{11b}$$

$$S\_{t,k,j}^{p}(\xi\_t) = S\_{t-1,k,j}^{p}(\xi\_{t-1}) - a\_{t,k,j}' + a\_{t,k,j}(\xi\_t) \quad \forall t \in \{3, \dots, |T|\}, j \in \mathcal{J}, k \in I\_j^{in} \tag{11c}$$

#### 2.5.3. Inventory Requests

Upper bounds on the replenishment orders are set, depending on the type of node. For the production nodes, downstream replenishment requests are limited by the production capacity, *cj*, as given in Equations (12a) and (12b). The requests are also limited by the available feedstock inventory at the production nodes that is transformed into products with a yield of *<sup>ν</sup>j*, as stated in Equations (13a) and (13b). Because distribution-only nodes do not have manufacturing areas, the upper bounds on any downstream replenishment requests are set by the available inventory at the distribution nodes, which is equivalent to setting *νj* to 1 in Equations (13a) and (13b). These sets of constraints ensure that the reorder quantities are always feasible, which means that the quantities requested are quantities that can be sold and shipped in the current period.

$$\sum\_{k \in J\_j^{out}} a\_{1,j,k} \le c\_j \quad \forall j \in J^{prod} \tag{12a}$$

$$\sum\_{k \in J\_j^{out}} a\_{t,j,k}(\xi\_t) \le c\_j \quad \forall t \in \{2, \dots, |T|\}, j \in J^{prod} \tag{12b}$$

$$\sum\_{k \in I\_j^{out}} a\_{1,j,k} \le S\_{1,j}^o \cdot \nu\_j \quad \forall j \in J^{prod} \cup J^{dist} \tag{13a}$$

$$\sum\_{k \in I\_j^{out}} a\_{t,j,k}(\mathbb{S}\_t) \le S\_{t,j}^o(\mathbb{S}\_t) \cdot \nu\_j \quad \forall t \in \{2, \dots, |T|\}, j \in J^{prod} \cup J^{dist} \tag{13b}$$

#### 2.5.4. Market Sales

The retailer node sells up to its available on-hand inventory in each period when the demand is realized, as given in Equations (14a) and (14b). This includes the start-of-period inventory plus any reorder quantities that arrive at the beginning of the period (before the markets open). Sales at the retailer nodes do not exceed the market demand, *dt*,*j*,*<sup>k</sup>*(*ξt*), as shown in Equation (15a). If backlogging is allowed, then any previous backlogged orders are added to the market demand, as shown in Equation (15b). If unfulfilled orders are counted as lost sales, then *u* is removed from Equation (15b).

$$\sum\_{k \in J\_j^{out}} S\_{2,j,k}^d(\xi\_2) \le S\_{1,j}^o \quad \forall j \in J^{retail} \tag{14a}$$

$$\sum\_{k \in f\_j^{out}} S\_{t,j,k}^d(\xi\_t) \le S\_{t-1,j}^o(\xi\_{t-1}) \quad \forall t \in \{3, \dots, |T|\}, j \in J^{retail} \tag{14b}$$

$$S\_{2,j,k}^{d}(\mathfrak{J}\_2) \le d\_{2,j,k}(\mathfrak{J}\_2) \quad \forall j \in J^{\text{retail}}, k \in J\_j^{\text{out}} \tag{15a}$$

$$S\_{t,j,k}^{d}(\xi\_t) \le d\_{t,j,k}(\xi\_t) + \mu\_{t-1,j,k}(\xi\_{t-1}) \quad \forall t \in \{3, \dots, |T|\}, j \in J^{\text{retail}}, k \in J\_{\downarrow}^{\text{out}} \tag{15b}$$

Unfulfilled demand at the retailer is the difference between the market demand and the actual retail sale in the current period (Equations (16a) and (16b). If the network operates under the lost sales mode, then the *u* term in the right-hand side of Equation (16b) is removed.

$$
\mu\_{2,j,k}(\xi\_2) = d\_{2,j,k}(\xi\_2) - S^d\_{2,j,k}(\xi\_2) \quad \forall j \in J^{\text{trial}}, k \in J^{\text{market}}\_j \tag{16a}
$$

$$u\_{t,j,k}(\xi\_t) = d\_{t,j,k}(\xi\_t) + u\_{t-1,j,k}(\xi\_{t-1}) - S^d\_{t,j,k}(\xi\_t) \quad \forall t \in \{3, \dots, |T|\}, j \in J^{\text{retail}}, k \in I^{\text{market}}\_{\} \tag{16b}$$

2.5.5. Variable Domains

$$\mathcal{R}\_{1,j} \in \mathbb{R}^1 \quad \forall j \in J \tag{17a}$$

$$R\_{t,j}(\mathbb{X}\_t) \in \mathbb{R} \quad \forall t \in \{2, \ldots, |T|\}, j \in J \tag{17b}$$

$$S\_{1,j}^{\sigma} \ge 0 \quad \forall j \in J \tag{18a}$$

$$S\_{t,j}^{o}(\mathfrak{J}\_t) \ge 0 \quad \forall t \in \{2, \ldots, |T|\}, j \in \mathcal{J} \tag{18b}$$

$$a\_{1,k,j} \colon S^p\_{1,k,j} \ge 0 \quad \forall j \in J, k \in J^{\text{in}}\_j \tag{19a}$$

$$a\_{t,k,\stackrel{\circ}{\cdot}}(\xi\_t)\_{\prime}, S^{p}\_{t,k,\stackrel{\circ}{\cdot}}(\xi\_t) \ge 0 \quad \forall t \in \{2, \dots, |T|\}, j \in \{l, k \in J^{in}\_{\stackrel{\circ}{\cdot}}\} \tag{19b}$$

$$S\_{t,j,k}^{d}(\mathbb{X}\_{t}),\ u\_{t,j,k}(\mathbb{X}\_{t}) \ge 0 \quad \forall t \in \{2, \ldots, |T|\}, j \in J^{\text{retain}}, k \in J\_{\text{j}}^{\text{out}} \tag{20}$$

#### *2.6. Scenario Tree for Multistage Stochastic Programming*

Equations (1)–(20) describe a multistage stochastic inventory managemen<sup>t</sup> problem. In principle, continuous or discrete probability distributions can be used to model the uncertain demands in the stochastic process (*ξ*1, *ξ*2, ... , *ξ<sup>t</sup>*, ... , *ξ*|*T*|). However, in most applications, a scenario-based approach is assumed for ease of computation, i.e., there are a finite number

of realizations of the uncertain parameter. For illustration purposes, Figure 2 shows a scenario tree that corresponds to a three-stage stochastic programming problem. At stage one, the decision-maker does not know the realizations of the uncertain parameters in the future time periods. At stage two, there are two different realizations of the uncertain parameters. The decision-maker can take different actions, depending on the realization of the uncertainty at stage two. For each realization at stage two, there are two different realizations at stage three. Therefore, the scenario tree that is presented in Figure 2 has four scenarios in total.

It is easy to observe that the number of scenarios grows exponentially with respect to the number of stages. For example, in the IMP, if we consider three realizations of the demand per stage, the total number of scenarios will be 3<sup>29</sup> for a 30 period problem. Moreover, a Poisson distribution is assumed for the distribution of the demand. In principle, there is an infinite number of realizations per stage. We introduce an approximation of the multistage stochastic programming problem in the next subsection in order to reduce the computational complexity.

**Figure 2.** The scenario tree for a three-stage stochastic program with two realizations per stage.

#### *2.7. Approximation for the Multistage Scenario Tree*

In order to make the multistage stochastic IMP tractable, we make the following two simplifications.

First, the Poisson( *λ*) distribution (as shown in Equation (21)) is approximated by a discrete distribution with three realizations.

$$p(\mathbf{x} = k) = \frac{\lambda^k e^{-\lambda}}{k!} \tag{21}$$

Note that the mean and variance of Poisson ( *λ*) are both *λ*. The values of the three realizations are chosen to be *λ* − √ *<sup>λ</sup>*, *λ*, *λ* + √ *<sup>λ</sup>*. The probabilities of the three realizations are chosen, such that the Wasserstein-1 distance to the original Poisson( *λ*) is minimized. In other words, for all *k* = 1, 2, ... , <sup>∞</sup>, the probability of *x* = *k* in Poisson ( *λ*) is assigned to the realization of the new distribution that is closest to *k*. The probabilities for this new distribution are given in Equations (22a)–(22c).

$$p(\mathbf{x} = \lambda - \lceil \sqrt{\lambda} \rceil) = \sum\_{k=1}^{\lambda - \lceil \frac{\sqrt{\lambda}}{2} \rceil} \frac{\lambda^k e^{-\lambda}}{k!} \tag{22a}$$

$$p(\mathbf{x} = \lambda) = \sum\_{k=\lambda-\lceil \frac{\sqrt{\lambda}}{2} \rceil + 1}^{\lambda + \lceil \frac{\sqrt{\lambda}}{2} \rceil - 1} \frac{\lambda^k e^{-\lambda}}{k!} \tag{22b}$$

$$p(\mathbf{x} = \lambda + \lceil \sqrt{\lambda} \rceil) = \sum\_{k=\lambda + \lceil \frac{\sqrt{\lambda}}{2} \rceil}^{+\infty} \frac{\lambda^k e^{-\lambda}}{k!} \tag{22c}$$

This scenario generation approach, where the values of the realizations are fixed and the probabilities of each realization are chosen to minimize the Wasserstein-1 distance between the probability distribution of the scenario tree from the "true distribution", has been reported in [23].

Second, even with three realizations per stage, the number of scenarios for *T* = 30 becomes 329. Because the decisions in the later periods have a smaller impact on the decisions here-and-now, we only consider three realizations per stage until stage 6. After stage 6, the demands are assumed to be deterministic and they take the mean value *λ*.

With these two simplifications, Figure 3 shows the scenario tree for the approximate multistage stochastic IMP. The size of the scenario tree grows exponentially until stage 6 and it only grows linearly after stage 6. In total, 35 = 243 scenarios are considered. With slight abuse of notation, we still denote the problem that is shown in Figure 3 as MSSP.

**Figure 3.** An approximation of the multistage stochastic program.

#### *2.8. Perfect Information and Deterministic Model*

We benchmark the MSSP model with a perfection information model and a deterministic model. In the perfect information model, it is assumed that the demand realizations from *t* = 1 to *t* = *T* are known beforehand. Equation (1) reduces to Equation (23) in a perfect information model, where we assume that we know the realization of (*ξ*2, ... , *ξ<sup>t</sup>*, ... , *ξ*|*T*|) and optimize over this given realization.

$$\max R(\zeta) = \max \frac{1}{|T|} \cdot \left( R\_1 + \sum\_{t \in \{2, \ldots, |T|\}} R\_t(\zeta\_t) \right) \tag{23}$$

In the deterministic model (DLP), we optimize over the expected value of *ξ<sup>t</sup>*, which is denoted as ¯ *ξ<sup>t</sup>*. This means that the demand is assumed to be *λ* (the mean of the Poisson distribution) for all periods.

#### *2.9. Reinforcement Learning Model*

Reinforcement learning (RL) is a machine learning method whereby an *agent* learns to maximize a *reward* via interactions with an *environment*. The feedback the agen<sup>t</sup> receives from the reward allows it to learn a *policy*, which is a function that directs the agen<sup>t</sup> at each step throughout the environment. Because of the data-intensive and interactive nature of RL, agents are typically trained by interacting with Monte Carlo simulations to make decisions at each time step.

To this end, we formulate the IMP as a Markov Decision Process (MDP)—a stochastic, sequential decision making problem. At each time period *t*, the agen<sup>t</sup> observes the current state of the system (*St*), and then selects an action (*at*) that is passed to the system. The simulation then advances to the next state (*St*+1) based on the realization of the random variables and the selected action, and it returns the associated reward (*Rt*), which is simply the profit function that is given in Equation (2a) summed over all nodes *j* (*Rt* = ∑*j*∈*<sup>J</sup> Rt*,*j*).

The state consists of a vector with entries for the current demand at the retail node, the inventory levels at each node in the network, and the inventory in the pipeline along

every edge in the network. In terms of the model notation from the previous subsection, *St* = {*dt*,*j*,*k*, *So t*,*j* , *Sp <sup>t</sup>*,*k*,*j* |*j* ∈ *J*, *k* ∈ *Jou<sup>t</sup> j* , *k* ∈ *Jin j* }. The action at each period is a vector with all of the reorder quantities in the network (*at*= {*at*,*k*,*j*|*j* ∈ *J*, *k* ∈ *Jin j*}).

In this work, we rely on the Proximal Policy Optimization (PPO), as described in Schulman et al. [24]. This has become a popular algorithm in the RL community, because it frequently exhibits stable learning characteristics. PPO is an actor-critic method that uses two neural networks that interact with one another. The actor is parameterized by *θπ* and it learns the policy, while the critic is parameterized by *θv* and it learns the value function. The policy the actor learns is probabalistic in nature and yields a probability distribution over actions during each forward pass. For our IMP, the action space consists of re-order values for each node in the network. These are discrete values that range from 0 to the maximum order quantity at each individual node. If the actor chooses an action that is greater than the quantity that can be supplied – for example, if the maximum re-order quantity is 100, but only 90 units are available – then the minimum of these two values will be supplied.

The critic learns the value function, which allows it to estimate the sum of the discounted, future rewards available at each state. The difference between the actual rewards and the estimated rewards supplied by the critic is known as the *temporal difference error* (TD-error, or *δT*). This difference is summed and discounted in order to provide the advantage estimation of the state, *A*ˆ *k*, as given in Equation (24),

$$\hat{A}\_k = \sum\_{T=1}^t \gamma^{t-T+1} \delta\_T \tag{24}$$

where *γ* is the discount factor used to prioritize current rewards over future rewards. This value is then used in the loss function, L(*θ*), whereby the paramaters of the networks are updated while using stochastic gradient descent to minimize the loss function. PPO has shown to be effective in numerous domains, exhibiting stable learning features, as discussed in Schulman et al. [24]. PPO achieves this by penalizing large policy updates by optimizing a conservative loss function given by Equation (25), where *rk*(*θ*) is the probability ratio between the new policy *<sup>π</sup>k*(*θ*) and the previous policy, *<sup>π</sup>k*−<sup>1</sup>(*θ*). Here, we use *k* to denote each policy iteration, since the parameters have been initialized. The clip function reduces the incentive for moving *rk*(*θ*) outside the interval [1 − , 1 + ]. The hyperparameter limits the update of the policy, such that the probability of outputs does not change more than ± at each update. For more detail, see the work by Schulman et al. [24].

$$\mathcal{L}(\theta) = \min \left( r\_k(\theta) \hat{A}\_k, \operatorname{clip} \left( r\_k(\theta), 1 - \epsilon, 1 + \epsilon \right) \hat{A}\_k \right) \tag{25}$$

In the present work, we rely on the implementation of the PPO algorithm found in the Ray package [25]. A two-layer, 256-node feed-forward network is trained with over 70,000 episodes—simulated 30-day periods, whereby the agen<sup>t</sup> learns to maximize the expected reward by interacting with the environment. Given that this is a model-free approach, the agen<sup>t</sup> must learn through this trial-and-error approach. The initial policy consists of randomly initialized weights and biases, so the output actions are on par with random decisions. After each episode, the results are collected and the weights and biases are updated in order to minimize the loss function according to the PPO algorithm, as discussed in Section 2.9. As one would expect, this initial policy performs poorly, with the agen<sup>t</sup> losing roughly \$300 per episode. However, as shown in the training curve presented in Figure 4, the agen<sup>t</sup> is able to improve on the policy with additional experience and learn a very effective policy to control the inventory across the network.

**Figure 4.** Training curve for the reinforcement learning (RL) model.

#### *2.10. Case Study*

A 30-period system is used for the case study in order to represent one month's worth of inventory managemen<sup>t</sup> in a supply network. A time step of one day is used, such that demand is received on a daily basis. Figure 5 depicts the network structure of the base case system used. Major parameters and their values are included in the figure. Parameters that are next to nodes are specific to the node (initial state or on-hand inventory, *S*0; unit holding costs, *h*; unit operating costs, *o*; production yield *ν*; and, production capacity, *c*), whereas those next to links are specific to that link (unit sales price, *p*; unfulfilled unit demand penalty, *b*; market demand distribution, *d*; unit pipeline holding costs, *g*; and, lead times, *L*). Node and link subscripts in the schematic are dropped for clarity. In order to compare the performance of the three modeling approaches (DLP, MSSP, and RL), 100 unique sample paths were generated. Each sample path consists of 30 demand realizations, one for each period in the simulation horizon, sampled from a Poisson distribution with a mean of 20.

The execution of the 100 simulations follows the sequence of events for each time period that is described in Section 2.2. At the beginning of each period and prior to Step 1 in the event sequence, each optimization model is called to obtain the reorder quantities for each node in that period. The models have no knowledge of the demand realizations that will occur in the current and future periods, but they can rely on the variables/states from previous periods. The reorder quantities for the current period obtained by the models are then passed as the actions in Step 1 of the sequence of events. Subsequently, the subsequent events for that period unfold, with the demand realization for that given period being taken from the respective sample path that is assigned to that simulation instance. The process is repeated for the next period, re-solving the models at the beginning of each time period, until the 30 periods are complete.

The three modeling approaches are benchmarked against a perfect information model (also referred to as the *Oracle*). 10-period windows are used for the rolling horizon (RH) modes in the DLP and MSSP models. However, towards the end of the simulation, the RH becomes a shrinking horizon (SH), since the window does not roll past period 30.

**Figure 5.** Supply Chain Network Schematic with Network Parameters used in the Case Study.
