**4. Discussion**

The results shown in Table 2 indicate that the rolling horizon DLP model outperforms the RL model when backlog is included, but it is outperformed by the latter when unfulfilled demands become lost sales. When backlogging is allowed, unfulfilled demand can be satisfied at a later period with a penalty, which reduces the need for high service levels. However, the service levels become more important in the lost sales case, where, not only is a goodwill penalty assessed, but potential profit from the sales is lost. Because the RL does a better job at maintaining on-hand inventories it displays the higher service levels shown in Figure 9 and superior performance in the lost sales case. It should be noted that the differences between the two approaches are rather small (7% and 3%, respectively), and within 15% of the perfect information model.

As expected, the shrinking horizon DLP exhibits superior performance relative to its rolling horizon counterpart, because it looks further ahead in time during the optimization. In the rolling horizon approach, the short-sighted model tends to drop inventory levels

at the top tier suppliers (nodes 4–6) sooner in an attempt to reduce the inventory holding costs towards the end of the optimization window. However, since the simulation horizon extends beyond the 10-period optimization window, that inventory ends up accumulating in the medium tier suppliers (nodes 2–3), driving up holding costs overall. From a service level standpoint, the shrinking horizon DLP maintains higher inventory levels at the retailer than its rolling horizon counterpart, allowing it to achieve higher service levels (see Figure 9). However, it is interesting to note that the opposite is observed in the MSSP model, which, despite having a higher profit, has lower service levels in the shrinking horizon case (higher unfulfilled demand). The greater profit is a result of the shrinking horizon reducing holding costs by 13% overall, which has a greater impact on profit than the unfulfilled demand penalties. Just at the retailer node, the holding cost to demand penalty ratio is 3:1, which incentivizes the model to sacrifice some demand satisfaction to reduce the holding costs. Overall, the MSSP model yields superior performance in all cases, coming in within 8% and 3% of the best possible outcome (*Oracle*), on average, for the rolling horizon and shrinking horizon modes, respectively.

From an operational standpoint, the Oracle and shrinking horizon models prioritize inventory flow to the retailer via nodes 5 and 2, which have a lower holding cost than the alternatives as shown in the timing of inventory transfers in Figure 6 and the flow patterns in Figure 8. The transportation cost for this path is 0.015, with a lead time of 14 days, whereas the other paths have transportation costs in the range 0.017–0.021, with lead times in the 13–16 day range. Once inventory at node 5 is depleted, the other top level suppliers (nodes 4 and 6) begin to send inventory downstream. On the other hand, the rolling horizon models send inventory from all of the top level suppliers from the start due to the myopic effects of the reduced optimization window. In general terms, the inventory profiles that are shown in Figure 6 for the rolling horizon models are similar to their shrinking horizon counterparts, except that the inventory changes are shifted to earlier times. All of the mathematical programming models also take advantage of the fact that the pipeline inventory costs are lower than holding costs at the supplier nodes. Therefore, they trigger sending more inventory to node 3 from nodes 4 and 6 than is needed so as to reduce costs. This additional inventory ends up accumulating in node 3 for the most part, as it is cheaper to source the retailer from node 2 than node 3. Although the DLP and MSSP models exhibit similar inventory profiles, the superiority of the MSSP model arises from the fact that, unlike the DLP model, it accounts for uncertainty in the demand, which enables it to target superior service levels and reduce holding costs.

In contrast to the mathematical programming models, the RL model avoids drastic changes in the inventory positions, maintaining levels throughout the simulation. This is supported not only by the inventory levels in Figure 6, but also by the flow pattern shown in Figure 7, which indicate that, contrary to the mathematical programming models, the RL model distributes requests more evenly amongs<sup>t</sup> the suppliers of each node. This conservative approach explains why the profits obtained with the RL model are lower than those that were obtained by most of mathematical programming models. In practice, the policy from the RL model is preferred as it reduces shocks to the inventory levels. Furthermore, the RL policy manages the supply network with potentially greater resiliency to disruptions as a result of the balanced load distribution within the network. Unlike the other models that have virtually no flow between the raw material nodes to the top tier suppliers and rely solely on the initial inventory at these nodes, the RL model gradually replenishes inventories at the top tier nodes in order to avoid their depletion. This conservative behavior of the RL is observed as a result of the PPO algorithm used, which penalizes large policy changes.

A drawback from the current implementation of the supply network is that all of the models exhibit end-of-simulation effects, in which the inventory drops to zero or near zero at the end of the simulation to avoid excess holding costs. In a real application, this could be avoided by imposing penalties on the models in order to avoid depleting inventories near the end of the simulation, adding terminal inventory constraints (Lima et al. [26]), or running the models for longer simulation horizons, since most of the applications extend beyond 30 periods. The latter option would not be viable for the stochastic programming models as it would affect their tractability. Despite these limitations, the three approaches show promise in obtaining dynamic reorder policies that improve the supply network performance to within 3% to 15% of perfect information dynamic policies, which do not exist in practice.
