*Article* **Integrating Production Planning with Truck-Dispatching Decisions through Reinforcement Learning While Managing Uncertainty**

**Joao Pedro de Carvalho \* and Roussos Dimitrakopoulos \***

COSMO—Stochastic Mine Planning Laboratory, Department of Mining and Materials Engineering, McGill University, 3450 University Street, Montreal, QC H3A 0E8, Canada

**\*** Correspondence: joao.decarvalho@mail.mcgill.ca (J.P.d.C.); roussos.dimitrakopoulos@mcgill.ca (R.D.)

**Abstract:** This paper presents a new truck dispatching policy approach that is adaptive given different mining complex configurations in order to deliver supply material extracted by the shovels to the processors. The method aims to improve adherence to the operational plan and fleet utilization in a mining complex context. Several sources of operational uncertainty arising from the loading, hauling and dumping activities can influence the dispatching strategy. Given a fixed sequence of extraction of the mining blocks provided by the short-term plan, a discrete event simulator model emulates the interaction arising from these mining operations. The continuous repetition of this simulator and a reward function, associating a score value to each dispatching decision, generate sample experiences to train a deep Q-learning reinforcement learning model. The model learns from past dispatching experience, such that when a new task is required, a well-informed decision can be quickly taken. The approach is tested at a copper–gold mining complex, characterized by uncertainties in equipment performance and geological attributes, and the results show improvements in terms of production targets, metal production, and fleet management.

**Keywords:** truck dispatching; mining equipment uncertainties; orebody uncertainty; discrete event simulation; Q-learning

#### **1. Introduction**

In short-term mine production planning, the truck dispatching activities aim to deliver the supply material, in terms of quantity and quality, being extracted from the mining fronts by the shovels to a destination (e.g., processing facility, stockpile, waste dump). The dispatching decisions considerably impact the efficiency of the operation and are of extreme importance as a large portion of the mining costs are associated with truckshovel activities [1–4]. Truck dispatching is included under fleet optimization, which also comprises equipment allocation, positioning shovels at mining facies and defining the number of trucks required [2,5,6]. Typically, the truck dispatching and allocation tasks are formulated as a mathematical programming approach whose objective function aims to minimize equipment waiting times and maximize production [7–11]. Some methods also use heuristic rules to simplify the decision-making strategy [12–14]. In general, a limiting aspect of the structure of these conventional optimization methods is related to the need to reoptimize the model if the configuration of the mining complex is modified, for example, if a piece of fleet equipment breaks. Alternatively, reinforcement learning (RL) methods [15] provide means to make informed decisions under a variety of situations without retraining, as these methods learn from interacting with an environment and adapt to maximize a specific reward function. The ability to offer fast solutions given multiple conditions of the mining complex is a step towards generating real-time truck dispatching responses. Additionally, most methods dealing with fleet optimization are applied to single mines, whereas an industrial mining complex is a set of integrated operations and facilities

**Citation:** de Carvalho, J.P.; Dimitrakopoulos, R. Integrating Production Planning with Truck-Dispatching Decisions through Reinforcement Learning While Managing Uncertainty. *Minerals* **2021**, *11*, 587. https://doi.org/10.3390/ min11060587

Academic Editors: Rajive Ganguli, Sean Dessureault and Pratt Rogers

Received: 16 April 2021 Accepted: 26 May 2021 Published: 31 May 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

transforming geological resource supply into sellable products. A mining complex can include multiple mines, stockpiles, tailing dams, processing routes, transportation systems, equipment types and sources of uncertainty [16–27].

The truck-dispatching model described herein can be viewed as a particular application belonging to the field of material delivery and logistics in supply chains, commonly modelled as vehicle routing problems and variants [28–30]. Dynamic vehicle routing problems [31,32] are an interesting field which allows for the inclusion of stochastic demands [33] and situations where the customer's requests are revealed dynamically [34]. These elements can also be observed in truck-dispatching activities in mining complexes, given that different processors have uncertain performances and that production targets may change, given the characteristics of the feeding materials. One particularity of the truck-dispatching model herein is that the trips performed between shovels and destinations usually have short lengths and are repeated multiple times. Another important aspect is that uncertainty arises from the geological properties of the transported materials and the performance of different equipment. Over the last two decades, there is an effort to develop frameworks accommodating uncertainties in relevant parameters within the mining complex operations to support more informed fleet management decisions. Not accounting for the complexity and uncertainties inherent to operational aspects misrepresent queue times, cycling times and other elements, which inevitably translates to deviation from production targets [6,35]. Ta et al. [9] allocate the shovels by a goal programming approach, including uncertainties in truckload and cycle times. Few other approaches optimize fleet management and production scheduling in mining complexes under both geological and equipment uncertainty [22,36,37].

A common strategy to model the stochastic interactions between equipment and processors in an operating mining environment is through the use of discrete event simulation (DES) approaches [35,38–43]. The DES allows for replacing an extensive mathematical description or rule concerning stochastic events by introducing randomness and probabilistic parameters related to a sequence of activities. The environment is characterized numerically by a set of observable variables of interest, such that each event modifies the state of the environment [44]. This simulation strategy can be combined with ideas from optimization approaches. Jaoua et al. [45] describe a detailed truck-dispatching control simulation, emulating real-time decisions, coupled with a simulated annealing-based optimization that minimizes the difference between tonnage delivered and associated targets. Torkamani and Askari-Nasab [35] propose a mixed integer programming model to allocate shovels to mining facies and establish the number of required truck trips. The solution's performance is assessed by a DES model that includes stochastic parameters such as truck speed, loading and dumping times, and equipment failure behavior. Chaowasakoo et al. [46] study the impact of the match factor to determine the overall efficiency of truck-shovel operations, combining a DES and selected heuristics maximizing production. Afrapoli et al. [47] propose a mixed integer goal programming to reduce shovel and truck idle times and deviations from production targets. A simulator of the mining operations triggers the model to be reoptimized every time a truck requires a new allocation. Afrapoli et al. [11] combine a DES with a stochastic integer programming framework to minimize equipment waiting times.

It is challenging to formulate all the dynamic and uncertain nature of the truck-shovel operation into a mathematical formulation. The daily operations in a mining complex are highly uncertain; for example, equipment failure, lack of staff or weather conditions can cause deviations in production targets and cause modifications in the dispatching policy. These events change the performance of the mining complex; thus, the related mathematical programming model needs to be reoptimized. The use of DES of the mining complex facilitates the modelling of such events. Note that some of the above mentioned approaches simulate the mining operations to assess the dispatching performance or improve it, using heuristic approaches. This strategy can generate good solutions, but the models do not learn from previous configurations of the mining complex.

Unlike the in the mentioned heuristic approaches, RL-based methods can take advantage of a mining complex simulator to define agents (decision-makers) that interact with this environment based on actions and rewards. The repetition of such interaction provides these agents with high learning abilities, which enables fast responses when a new assignment is required. Recent approaches have achieved high-level performances over multiple environments that require complex and dynamic tasks [48–55]. They have also been applied to some short-term mine planning aspects showing interesting results [56–58]. [56–58]. This paper presents a truck‐dispatching policy based on deep Q‐learning, one of the most popular RL approaches, in order to improve daily production and overall fleet per‐ formance, based on the work in Hasselt et al. [50]. A DES is used to model daily opera‐ tional aspects, such as loading, hauling and dumping activities, generating samples, to improve the proposed truck dispatching policy. A case study applies the method to a cop‐

Unlike the in the mentioned heuristic approaches, RL‐based methods can take ad‐ vantage of a mining complex simulator to define agents (decision‐makers) that interact with this environment based on actions and rewards. The repetition of such interaction provides these agents with high learning abilities, which enables fast responses when a new assignment is required. Recent approaches have achieved high‐level performances over multiple environments that require complex and dynamic tasks [48–55]. They have also been applied to some short‐term mine planning aspects showing interesting results

*Minerals* **2021**, *11*, x FOR PEER REVIEW 3 of 17

This paper presents a truck-dispatching policy based on deep Q-learning, one of the most popular RL approaches, in order to improve daily production and overall fleet performance, based on the work in Hasselt et al. [50]. A DES is used to model daily operational aspects, such as loading, hauling and dumping activities, generating samples, to improve the proposed truck dispatching policy. A case study applies the method to a copper–gold mining complex, which considers equipment uncertainty, modelled from historical data, and orebody simulations [59–63] that assess the uncertainty and variability related to metal content within the resource model. Conclusion and future work follow. per–gold mining complex, which considers equipment uncertainty, modelled from his‐ torical data, and orebody simulations [59–63] that assess the uncertainty and variability related to metal content within the resource model. Conclusion and future work follow. **2. Method** The proposed method adapts the deep double Q‐learning neural network (DDQN)

#### **2. Method** method [50] for dispatching trucks in a mining environment. The RL agents continually

The proposed method adapts the deep double Q-learning neural network (DDQN) method [50] for dispatching trucks in a mining environment. The RL agents continually take actions over the environment and receive rewards associated with their performances [15]. Herein, each mining truck is considered an agent; therefore, these terms are used interchangeably throughout this paper. The DES, described in Section 2.1, receives the decisions made by the agents, simulates the related material flow and a reward value evaluating each action. Section 2.2 defines the reward function and how the agents interact with the RL environment; where the observed states and rewards compose the samples used to train the DDQN. Subsequently, Section 2.3 presents the training algorithm based on Hasselt et al. [50], updating the agent's parameters (neural network weights). Figure 1 illustrates the workflow showing the interaction between the DES and the DDQN policy. take actions over the environment and receive rewards associated with their perfor‐ mances [15]. Herein, each mining truck is considered an agent; therefore, these terms are used interchangeably throughout this paper. The DES, described in Section 2.1, receives the decisions made by the agents, simulates the related material flow and a reward value evaluating each action. Section 2.2 defines the reward function and how the agents inter‐ act with the RL environment; where the observed states and rewards compose the sam‐ ples used to train the DDQN. Subsequently, Section 2.3 presents the training algorithm based on Hasselt et al. [50], updating the agent's parameters (neural network weights). Figure 1 illustrates the workflow showing the interaction between the DES and the DDQN policy.

**Figure 1.** Workflow of the interaction between the DES and the DDQN method. **Figure 1.** Workflow of the interaction between the DES and the DDQN method.

#### *2.1. Discrete Event Simulator*

*2.1. Discrete Event Simulator* The discrete event simulator presented in this work assumes a predefined sequence of extraction, the destination policy of each mining block and the shovel allocation. It also The discrete event simulator presented in this work assumes a predefined sequence of extraction, the destination policy of each mining block and the shovel allocation. It also presumes that the shortest paths between shovels and destinations have been defined.

Figure 2 illustrates this relationship where the black arrow is the predefined destination path for the block being extracted by the shovel. After the truck delivers the material to the dumping point (waste dump, processing plant or leaching pad, for example), a dispatching policy must define the next shovel assignment. The red arrow illustrates the path options for dispatching. path for the block being extracted by the shovel. After the truck delivers the material to the dumping point (waste dump, processing plant or leaching pad, for example), a dis‐ patching policy must define the next shovel assignment. The red arrow illustrates the path options for dispatching.

*Minerals* **2021**, *11*, x FOR PEER REVIEW 4 of 17

**Figure 2.** Representation of the possible paths a truck can follow: the pre‐defined destination of each block (black arrow); the possible next dumping point (red arrows). **Figure 2.** Representation of the possible paths a truck can follow: the pre-defined destination of each block (black arrow); the possible next dumping point (red arrows).

To simulate the operational interactions between shovels, trucks and dumping loca‐ tions present in the mining complex, the DES considers the following major events: To simulate the operational interactions between shovels, trucks and dumping locations present in the mining complex, the DES considers the following major events:

Shovel Loading Event: The shovel loads the truck with an adequate number of loads. The total time required for this operation is stochastic, and once the truck is loaded, it leaves the shovel as the destination, triggering the "Truck Moving Event." If the shovel must move to a new extraction point, it incurs a delay, representing the time taken to reposition the equipment. After the truck leaves the loading point, this event can trigger itself if there is another truck waiting in the queue. Shovel Loading Event: The shovel loads the truck with an adequate number of loads. The total time required for this operation is stochastic, and once the truck is loaded, it leaves the shovel as the destination, triggering the "Truck Moving Event." If the shovel must move to a new extraction point, it incurs a delay, representing the time taken to reposition the equipment. After the truck leaves the loading point, this event can trigger itself if there is another truck waiting in the queue.

Truck Moving Event: This event represents the truck going from a shovel to a dump‐ ing location, or vice versa. Each travelling time is sampled from a distribution approxi‐ mated from historical data. Travelling empty or loaded impacts on the truck speed, mean‐ ing that time values are sampled from different distributions in these situations. When the truck arrives at the loading point and the shovel is available, this event triggers a "Shovel Loading Event"; otherwise, it joins the queue of trucks. If the truck arrives at the dumping location, the event performs similarly; if the destination is empty, this event triggers a "Truck Dumping Event," otherwise, the truck joins the queue of trucks. Truck Moving Event: This event represents the truck going from a shovel to a dumping location, or vice versa. Each travelling time is sampled from a distribution approximated from historical data. Travelling empty or loaded impacts on the truck speed, meaning that time values are sampled from different distributions in these situations. When the truck arrives at the loading point and the shovel is available, this event triggers a "Shovel Loading Event"; otherwise, it joins the queue of trucks. If the truck arrives at the dumping location, the event performs similarly; if the destination is empty, this event triggers a "Truck Dumping Event," otherwise, the truck joins the queue of trucks.

Truck Dumping Event: This event represents the truck delivering the material to its destination, to a waste dump or a processing plant, for example. The time to dump is stochastic, and after the event is resolved, a "Truck Moving Event" is triggered to send the truck back to be loaded. Here, a new decision can be made, sending the truck to a different shovel. Similar to the "Shovel Loading Event," once this event is finished, it can trigger itself if another truck is in the queue waiting for dumping. Truck Dumping Event: This event represents the truck delivering the material to its destination, to a waste dump or a processing plant, for example. The time to dump is stochastic, and after the event is resolved, a "Truck Moving Event" is triggered to send the truck back to be loaded. Here, a new decision can be made, sending the truck to a different shovel. Similar to the "Shovel Loading Event," once this event is finished, it can trigger itself if another truck is in the queue waiting for dumping.

Truck Breaking Event: Represents a truck stopping its activities due to maintenance or small failures. In this event, a truck is removed from the DES regardless of its current assignment. No action can be performed until it is fixed and can be returned to the oper‐ ation. Truck Breaking Event: Represents a truck stopping its activities due to maintenance or small failures. In this event, a truck is removed from the DES regardless of its current assignment. No action can be performed until it is fixed and can be returned to the operation.

Shovel Breaking Event: Represents the shovel becoming inaccessible for a certain period due to small failures or maintenance. No material is extracted during this period, and no trucks are sent to this location, being re-routed until the equipment is ready to be operational again. Shovel Breaking Event: Represents the shovel becoming inaccessible for a certain pe‐ riod due to small failures or maintenance. No material is extracted during this period, and no trucks are sent to this location, being re‐routed until the equipment is ready to be op‐ erational again.

Figure 3a shows a diagram illustrating a possible sequence of events that can be triggered. In the figure, the solid lines represent the events triggered immediately after the end of a particular event. The dashed lines are related to events that can be triggered if trucks are waiting in the queue. To ensure the sequence respects a chronological ordering, a priority queue is maintained, where each event is ranked by its starting time, as illustrated in Figure 3b. Figure 3a shows a diagram illustrating a possible sequence of events that can be trig‐ gered. In the figure, the solid lines represent the events triggered immediately after the end of a particular event. The dashed lines are related to events that can be triggered if trucks are waiting in the queue. To ensure the sequence respects a chronological ordering, a priority queue is maintained, where each event is ranked by its starting time, as illus‐ trated in Figure 3b.

**Figure 3.** Discrete event simulation represented in terms of: (**a**) an initial event and possible next events that can be triggered; (**b**) a priority queue that ranks each event by its starting time. **Figure 3.** Discrete event simulation represented in terms of: (**a**) an initial event and possible next events that can be triggered; (**b**) a priority queue that ranks each event by its starting time.

The DES starts with all the trucks being positioned at their respective shovel. This configuration triggers a "Shovel Loading Event," and the DES simulates the subsequent events and how much material flows from the extraction point to their destinations by the trucks. Once the truck dumps, a new decision is taken according to the DDQN policy. The DES proceeds by simulating the subsequent operations triggered by this assignment. This is repeated until the predefined time horizon, which represents ௗ௬௦ of simulated activ‐ ities, is reached by the DES. All events that occur between the beginning and the end of the DES constitute an episode. Subsequent episodes start by re‐positioning the trucks at their initial shovel allocation. The DES starts with all the trucks being positioned at their respective shovel. This configuration triggers a "Shovel Loading Event," and the DES simulates the subsequent events and how much material flows from the extraction point to their destinations by the trucks. Once the truck dumps, a new decision is taken according to the DDQN policy. The DES proceeds by simulating the subsequent operations triggered by this assignment. This is repeated until the predefined time horizon, which represents *Ndays* of simulated activities, is reached by the DES. All events that occur between the beginning and the end of the DES constitute an episode. Subsequent episodes start by re-positioning the trucks at their initial shovel allocation.

#### *2.2. Agent–Environment Interaction 2.2. Agent–Environment Interaction*

#### 2.2.1. Definitions 2.2.1. Definitions

The framework considers ௧௨௦ trucks interacting with the DES. At every time step ∈, after dumping the material into the adequate location, a new assignment for truck ∈௧௨௦ is requested. The truck‐agent observes the current state ௧ ∈ , where ௧ represents the perception of truck on how the mining complex is performing at step and takes an action ௧ ∈ , defining the next shovel to which the truck will be linked. The state ௧ is a vector encoding all attributes relevant to characterize the current status of the mining complex. Figure 4 illustrates these attributes describing the state space, such as The framework considers *Ntrucks* trucks interacting with the DES. At every time step *t* ∈ *T*, after dumping the material into the adequate location, a new assignment for truck *i* ∈ *Ntrucks* is requested. The truck-agent *i* observes the current state *S i <sup>t</sup>* ∈ *S*, where *S i t* represents the perception of truck *i* on how the mining complex is performing at step *t* and takes an action *A i <sup>t</sup>* ∈ *A*, defining the next shovel to which the truck will be linked. The state *S i t* is a vector encoding all attributes relevant to characterize the current status of the mining complex. Figure 4 illustrates these attributes describing the state space, such as current queue sizes, current GPS location of trucks and shovels, and processing plant requirements.

where ௧

tivity.

truck ௧

This state information is encoded in a vector and inputted into the DDQN neural network, which outputs action-values, one for each shovel, representing the probability that the truck be dispatched to a shovel-dumping point path. A more detailed characterization of the state *S i t* is given in Appendix A. neural network, which outputs action‐values, one for each shovel, representing the prob‐ ability that the truck be dispatched to a shovel‐dumping point path. A more detailed char‐ acterization of the state ௧ is given in Appendix A.

current queue sizes, current GPS location of trucks and shovels, and processing plant re‐ quirements. This state information is encoded in a vector and inputted into the DDQN

**Figure 4.** Illustration of the DDQN agent, which receives as input the state of the environment as input and outputs the desirability probability of choosing an action. **Figure 4.** Illustration of the DDQN agent, which receives as input the state of the environment as input and outputs the desirability probability of choosing an action.

#### 2.2.2. Reward Function

*Minerals* **2021**, *11*, x FOR PEER REVIEW 6 of 17

2.2.2. Reward Function Once the agent outputs the action ௧ , the DES emulates how the mining complex environment evolves by simulating, for example, new cycle times, the formation of queues, taking into consideration all other trucks in operation. The environment, then, Once the agent outputs the action *A i t* , the DES emulates how the mining complex environment evolves by simulating, for example, new cycle times, the formation of queues, taking into consideration all other trucks in operation. The environment, then, replies to this agent's decision with a reward function, represented by Equation (1):

$$\mathcal{R}\_t^i = perc\_t^i - pq\_t^i \tag{1}$$

௧ ൌ ௧ െ ௧ (1) is the reward associated with delivering material to the mill and accomplish‐ ing a percentage of the destination's requirement (e.g., mill's daily target in tons/day). ௧ where *perc<sup>i</sup> t* is the reward associated with delivering material to the mill and accomplishing a percentage of the destination's requirement (e.g., mill's daily target in tons/day). *pq<sup>i</sup> t* is the penalty associated with spending time in queues at both shovels and destinations. This term guides solutions with smaller queue formation while ensuring higher productivity.

is the penalty associated with spending time in queues at both shovels and destinations. This term guides solutions with smaller queue formation while ensuring higher produc‐ In this multi-agent setting, each truck receives a reward *R<sup>t</sup>* , which is the sum of each truck *R i t* , as shown in Equation (2), to ensure that all agents aim to maximize the same reward function.

$$\mathcal{R}\_l = \sum\_{i}^{N\_{\text{products}}} \mathcal{R}\_l^i \tag{2}$$

, as shown in Equation (2), to ensure that all agents aim to maximize the same reward function. During each episode, the agent performs *Nsteps* actions, the discounted sum of rewards is the called return presented by Equation (3):

$$\mathbf{G}\_{l} = \mathbf{R}\_{l+1} + \gamma \mathbf{R}\_{l+2} + \gamma^2 \mathbf{R}\_{l+3} + \dots + \gamma^{N\_{\text{steps}} - l - 1} \mathbf{R}\_{\text{Nsteps}} = \sum\_{k=t+1}^{N\_{\text{steps}}} \gamma^{k-t-1} \mathbf{R}\_{k} \tag{3}$$

During each episode, the agent performs ௦௧௦ actions, the discounted sum of re‐ wards is the called return presented by Equation (3): ேೞೞ where *γ* is a discounting factor parameter, which defines how much actions taken far in the future impact the objective function [15]. Equation (4) defines the objective, which is to obtain high-level control by training the agent to take improved actions so that the trucks can fulfil the production planning targets and minimize queue formation.

$$\max\_{a \in A} \mathbb{E}\left[G\_t \Big| S = S\_t^i, A = A\_t^i\right] \tag{4}$$

൧ (4)

∈ ൣ௧ห ൌ ௧

trucks can fulfil the production planning targets and minimize queue formation.

max

where is a discounting factor parameter, which defines how much actions taken far in

to obtain high‐level control by training the agent to take improved actions so that the

The environment is characterized by uncertainties related to loading, moving, dump‐ ing times of the equipment, breakdowns of both trucks and shovels. This makes it very

, ൌ ௧

The environment is characterized by uncertainties related to loading, moving, dumping times of the equipment, breakdowns of both trucks and shovels. This makes it very difficult to define all possible transition probabilities between states *p S i t*+1 *S* = *S i t* , *A* = *A i t* to obtain the expected value defined in Equation (4). Therefore, these transition probabilities are replaced by the Monte Carlo approach used in the form of the DES.

The framework allows for future actions to be rapidly taken since providing the input vector *S<sup>t</sup>* to the neural network and outputting the corresponding action is a fast operation. This means that the speed at which the decisions can be made depends more on how quickly the attributes related to the state of the mining complex can be collected, which has been recently substantially improved with the new sensors installed throughout the operation.

#### *2.3. Deep Double Q-Learning (DDQN)*

The approach used in the current study is the double deep Q-learning (DDQN) approach based on the work of Hasselt et al. [50]. Q-function *Q<sup>i</sup> S i t* , *A i t* , *w i t* is the actionvalue function, shown in Equation (5), which outputs values representing the likelihood of truck *i* choosing action *A i t* , given the encoded state *S i t* and the set of neural-network weights *w i t* , illustrated by Figure 4.

$$\mathbf{Q}\_{i}\left(\mathbf{S}\_{t'}^{i}\boldsymbol{A}\_{t'}^{i}\boldsymbol{w}\_{t}^{i}\right) = E\left[\mathbf{G}\_{t}\middle|\mathbf{S} = \mathbf{S}\_{t'}^{i}\boldsymbol{A} = \boldsymbol{A}\_{t}^{i}\right] \tag{5}$$

Denote *Q*∗ *i s i t* , *a i t* , *w i* to be the theoretical optimal action-value function. Equation (6) presents the optimal policy *π* ∗ *S i t* for the state *S i t* , which is obtained by using the actionfunction greedily:

$$\pi^\* \left( S\_t^i \right) = \underset{a' \in A}{\operatorname{argmax}} \mathbb{Q}\_t^\* \left( S\_{t'}^i \, a' \, \omega^i \right) \tag{6}$$

Note that, using Equation (6), the approach directly maximizes the reward function described in Equation (4). This is accomplished by updating the *Q<sup>i</sup> S i t* , *A i t* , *w i t* function to approximate the optimal action-value function *Qi S i t* , *A i t* , *w i t* → *Q*<sup>∗</sup> *i S i t* , *A i t* , *w i* .

By letting agent *i* interact with the environment, given the state *S i t* , the agent chooses *A i t* , following a current dispatching policy *π<sup>i</sup> S i t* = argmax *a* <sup>0</sup>∈*A Qi S i t* , *a* 0 , *w i t* , the environment then returns the reward *R<sup>t</sup>* and a next state *S i t*+1 . The sample experience *e i <sup>k</sup>* <sup>=</sup> *S i t* , *A i t* , *R<sup>t</sup>* , *S i t*+1 is stored in a memory buffer, *D<sup>i</sup> <sup>K</sup>* = *e i* 1 ,*e i* 2 , . . . ,*e i K* , which is increased as

the agent interacts with the environment for additional episodes. A maximum size limits this buffer, and once it is reached, the new sample *e i k* replaces the oldest one. This is a known strategy called experience replay, which helps stabilize the learning process [48,50,64].

In the beginning, *Q<sup>i</sup> S i t* , *A i t* , *w i t* is randomly initialized, then a memory tuple *e i k* is repeatedly uniformly sampled from the memory buffer *D<sup>i</sup> K* , and the related *e i <sup>k</sup>* <sup>=</sup> *S i t* , *A i t* , *R<sup>t</sup>* , *S i t*+1 values are used to estimate the expected future return *G<sup>t</sup>* , as shown in Equation (7):

$$
\overline{\mathbf{G}}\_{l} = \begin{cases}
\begin{array}{c}
\text{R}\_{l\prime} \\
\text{R}\_{l} + \gamma \overline{\mathbf{Q}}\_{l}
\end{array} & \text{if } episode \text{ terminates at } t+1 \\
\begin{array}{c}
\text{R}\_{l} + \gamma \overline{\mathbf{Q}}\_{l}
\text{(}\begin{array}{c}
\text{S}\_{l\prime}^{i} \\
\text{a}^{i} \text{e}^{\cdot}
\end{array} \text{argmax } \mathbf{Q}\_{i}
\text{(}\begin{array}{c}
\text{S}\_{l+1}^{i},a^{\prime},w\_{l}^{i}\text{)}, \; \overline{w}^{i}
\end{array} \text{)}, \quad \text{otherwise}
\end{cases} \tag{7}
$$

Additionally, gradient descent is performed on *G<sup>t</sup>* − *Q<sup>i</sup> S i t* , *A i t* , *w i t* <sup>2</sup> with respect to the parameter weights *w i t* . Note that a different Q-function, *Q<sup>i</sup>* (·), is used to predict the future reward; this is simply the *Q<sup>i</sup>* (·) with the old weight parameters. Such an approach is also used to stabilize the agent's learning, as noisy environments can result in a slow learning process [50]. After *NUpdt* steps, the weights *w i t* are copied to *w i* , as follows: *Qi* (·) = *Q<sup>i</sup>* (·).

During training, the agent *i* follows the greedy policy *π<sup>i</sup> S i t* meaning that it acts greedily with respect to its current knowledge. If gradient descent is performed with samples coming solely from *π<sup>i</sup> S i t* , the method inevitably would reach a local maximum very soon. Thus, to avoid being trapped in a local maximum, in *e*% of the time, the agent takes random actions exploring the solution space, sampling it from a uniform distribution *A i <sup>t</sup>* ∼ *U*(*A*). In (100 − *e*)% of the time, the agent follows the current policy *A i <sup>t</sup>* ∼ *π<sup>i</sup> S i t* . To take advantage of long-term gains, after every *Nsteps*\_*reduce* steps this value is reduced by a factor *reduce*\_ *f actor* ∈ [0, 1]. In summary, the algorithm is presented as follows:

**Algorithm 1** Proposed learning algorithm.

Initialize the action-functions. *Q<sup>i</sup>* (·) and *Q<sup>i</sup>* (·) by assigning initial weights to *w i t* and *w i* . Set *n*1*counter* = 0 and *n*2*counter* = 0. Initialize the DES, with the trucks at their initial locations (e.g., queueing them at the shovel). Repeat for each episode: Given the current truck-shovel allocation, the DES simulates the supply material being transferred from mining facies to the processors or waste dump by the trucks. Once the truck *i* dumps the material, a new allocation must be provided. At this point, the agent collects the information about the state *S i t* . Sample *u* ∼ *U*(0, 100) If *u* < *e*% The truck-agent *i* acts randomly *A i <sup>t</sup>* ∼ *U*(*A*) Else: The truck-agent *i* acts greedily *A i <sup>t</sup>* ∼ *π<sup>i</sup> S i t* Taking action *A i t* , observe *R<sup>t</sup>* and a new state *S i t*+1 . Store the tuple *e i <sup>k</sup>* = *S i t* , *A i t* , *R<sup>t</sup>* , *S i t*+1 in the memory buffer *D<sup>i</sup> K* Sample a batch of experiences *e i <sup>k</sup>* = *S i t* , *A i t* , *R<sup>t</sup>* , *S i t*+1 , of size *batch*\_*size,* from *D<sup>i</sup> K* : For each transition sampled, calculate the respective *G<sup>t</sup>* from Equation (7). Perform gradient descent on *Qi 1 st*+<sup>1</sup> , *a* 0 , *w i* 1 − *G<sup>t</sup>* 2 according to Equation (8): *w i* 1,*next* ← *w i* 1,*old* − *α Qi 1 st*+<sup>1</sup> , *a* 0 , *w i* 1 − *G<sup>t</sup>* ∇*w<sup>i</sup>* 1 *Qi 1 st*+<sup>1</sup> , *a* 0 , *w i* 1 (8) *n*1*counter* ← *n*1*counter* + 1 . *n*2*counter* ← *n*2*counter* + 1 . If *n*1*counter* ≥ *NUpdt*: *w <sup>i</sup>* <sup>←</sup> *<sup>w</sup> i t* . *n*1*counter* ← 0 . If *n*2*counter* ≥ *Nstep*\_*reduce*: *e* ← *e* ∗ *reduce*\_ *f actor*. *n*2*counter* ← 0 .

#### **3. Case Study at a Copper—Gold Mining Complex**

#### *3.1. Description and Implementation Aspects*

The proposed framework is implemented at a copper–gold mining complex, summarized in Figure 5. The mining complex comprises two open-pits, whose supply material is extracted by four shovels and transported by twelve trucks to the appropriate destinations: waste dump, mill or leach pad. Table 1 presents information regarding the mining equipment and processors. The shovels are placed at the mining facies following pre-defined extraction sequences, where the destination of each block was also pre-established beforehand. The mining complex shares the truck fleet between pits A and B. The waste dump receives waste material from both mines, whereas the leach pad material only processes supply material from pit B due to mineralogical characteristics. The truck going to the leach pad dumps the material into a crusher, then transported it to the leach pad. Regarding the milling material, each pit is associated with a crusher, and the trucks haul the high-grade material extracted from a pit and deliver it to the corresponding crusher. Next, a conveyor belt transfers this material to the mill combining the material from the two sources. Both

the mill and the leach pad are responsible for producing copper products and gold ounces to be sold. a conveyor belt transfers this material to the mill combining the material from the two sources. Both the mill and the leach pad are responsible for producing copper products and gold ounces to be sold.

*Minerals* **2021**, *11*, x FOR PEER REVIEW 9 of 17

**Figure 5.** Diagram of the mining complex. **Figure 5.** Diagram of the mining complex.

**Table 1.** Mining complex equipment and processors. **Table 1.** Mining complex equipment and processors.


Waste Dump 1 Waste Dump with no limitation on capacity.

The discrete event simulation, described in Section 2.1, emulates the loading, hauling and dumping operations in the mining complex. Each event is governed by uncertainties that impact the truck cycling times. Table 2 presents distributions used for the related uncertainty characterization. For simplicity, these stochastic distributions are approxi‐ mated from historical data; however, a more interesting approach would have been to use the distribution directly from historical data. When the truck dumps material into a des‐ tination, a new dispatching decision must be taken by the DDQN dispatching policy. This generates samples that are used to train the DDQN dispatching policy. During the train‐ ing phase, each episode lasts the equivalent of 3 days of continuous production, where the truck‐agent interacts with the discrete event mine simulator environment, taking ac‐ tions and collecting rewards. In total, the computational time needed for training, for the present case study, is around 4 h. For the comparison (testing) phase, the method was exposed to five consecutive days of production. This acts as a validation step, ensuring that the agents observe the mining complex's configurations which were unseen during training. The results presented show the five days of production, and the performance obtained illustrates that the method does not overfit regarding the three days of operation The discrete event simulation, described in Section 2.1, emulates the loading, hauling and dumping operations in the mining complex. Each event is governed by uncertainties that impact the truck cycling times. Table 2 presents distributions used for the related uncertainty characterization. For simplicity, these stochastic distributions are approximated from historical data; however, a more interesting approach would have been to use the distribution directly from historical data. When the truck dumps material into a destination, a new dispatching decision must be taken by the DDQN dispatching policy. This generates samples that are used to train the DDQN dispatching policy. During the training phase, each episode lasts the equivalent of 3 days of continuous production, where the truck-agent interacts with the discrete event mine simulator environment, taking actions and collecting rewards. In total, the computational time needed for training, for the present case study, is around 4 h. For the comparison (testing) phase, the method was exposed to five consecutive days of production. This acts as a validation step, ensuring that the agents observe the mining complex's configurations which were unseen during training. The results presented show the five days of production, and the performance obtained illustrates that the method does not overfit regarding the three days of operation but maintain a consistent strategy for the additional days.

but maintain a consistent strategy for the additional days. **Table 2.** Definition of stochastic variables considered in the mining complex.


the model needs to be retrained.

Note that although the DDQN policy provides dispatching decisions considering a different context from the one it was trained, the new situations cannot be totally different. It is assumed that in new situations, the DDQN experiences are sampled from the same distribution observed during training. In the case where the sequence of extraction changes considerably and new mining areas as well as other destinations are prioritized, the model needs to be retrained. Two baselines are presented to compare the performance of the proposed approach. The first one, referred to as fixed policy, is a strategy that continually dispatches the truck to the same shovel path throughout the episode. The performance comparison between the DDQN and fixed policy is denoted Case 1. The second approach, referred to as greedy policy, sends trucks to needy shovels with the shortest waiting times to decrease idle

Two baselines are presented to compare the performance of the proposed approach. The first one, referred to as fixed policy, is a strategy that continually dispatches the truck to the same shovel path throughout the episode. The performance comparison between the DDQN and fixed policy is denoted Case 1. The second approach, referred to as greedy policy, sends trucks to needy shovels with the shortest waiting times to decrease idle shovel time, denoted Case 2. Both cases start with the same initial placement of the trucks. shovel time, denoted Case 2. Both cases start with the same initial placement of the trucks. The environment is stochastic, in the sense that testing the same policy for multiple episodes generates different results. Therefore, for the results presented here, episodes of 5 days of continuous production are repeated 10 times for each dispatching policy. To assess uncertainty outcomes beyond the ones arising from operational aspects, geological

The environment is stochastic, in the sense that testing the same policy for multiple episodes generates different results. Therefore, for the results presented here, episodes of 5 days of continuous production are repeated 10 times for each dispatching policy. To assess uncertainty outcomes beyond the ones arising from operational aspects, geological uncertainty is also included in the assessment by considering 10 orebody simulations (Boucher and Dimitrakopoulos; 2009) characterizing the spatial uncertainty and variability of copper and gold grades in the mineral deposit. The graphs display results in P10, P50 and P90 percentile, corresponding to the probability of 10, 50 and 90%, respectively, of being below the value presented. uncertainty is also included in the assessment by considering 10 orebody simulations (Boucher and Dimitrakopoulos; 2009) characterizing the spatial uncertainty and variabil‐ ity of copper and gold grades in the mineral deposit. The graphs display results in P10, P50 and P90 percentile, corresponding to the probability of 10, 50 and 90%, respectively, of being below the value presented. *3.2. Results and Comparisons*

#### *3.2. Results and Comparisons* Figure 6 presents the daily throughput obtained by running the DES over the five

*Minerals* **2021**, *11*, x FOR PEER REVIEW 10 of 17

Truck mean time to repair (h) Poisson (5) Shovel mean time between failures (h) Poisson (42) Shovel mean time to repair (h) Poisson (4)

Note that although the DDQN policy provides dispatching decisions considering a different context from the one it was trained, the new situations cannot be totally different. It is assumed that in new situations, the DDQN experiences are sampled from the same

changes considerably and new mining areas as well as other destinations are prioritized,

Figure 6 presents the daily throughput obtained by running the DES over the five days of production, which is achieved by accumulating all material processed by the mill within each day. Note that here the P10, P50 and P90 are only due the equipment uncertainty. Overall, the proposed model delivers more material to the mill when compared to both cases. The DDQN method adapts the dispatching to move trucks around, relocating them to the shovels that are more in need, which constantly results in higher throughput. days of production, which is achieved by accumulating all material processed by the mill within each day. Note that here the P10, P50 and P90 are only due the equipment uncer‐ tainty. Overall, the proposed model delivers more material to the mill when compared to both cases. The DDQN method adapts the dispatching to move trucks around, relocating them to the shovels that are more in need, which constantly results in higher throughput.

**Figure 6.** Daily throughput at the mill compared the DDQN policy (black line) and the respective baselines (blue line): (**a**) fixed policy and (**b**) greedy policy. **Figure 6.** Daily throughput at the mill compared the DDQN policy (black line) and the respective baselines (blue line): (**a**) fixed policy and (**b**) greedy policy.

The throughput in day five drops compared to previous days, mostly due to a smaller availability of trucks as the DES considers failures in the trucks; Figure 7 presents the av‐ erage number of trucks available per day. During the initial three days, the availability of The throughput in day five drops compared to previous days, mostly due to a smaller availability of trucks as the DES considers failures in the trucks; Figure 7 presents the average number of trucks available per day. During the initial three days, the availability of trucks hovers between 10 and 12 trucks, but this rate drops in the last 2 days, which decreases the production. However, the trained policy can still provide a higher feed rate at the mill, even in this adversity. The availability of trucks on days 4 and 5 is smaller than the period for which the DDQN based method was trained, which shows an adapting capability of the dispatching approach. the period for which the DDQN based method was trained, which shows an adapting capability of the dispatching approach.

trucks hovers between 10 and 12 trucks, but this rate drops in the last 2 days, which de‐ creases the production. However, the trained policy can still provide a higher feed rate at the mill, even in this adversity. The availability of trucks on days 4 and 5 is smaller than

trucks hovers between 10 and 12 trucks, but this rate drops in the last 2 days, which de‐ creases the production. However, the trained policy can still provide a higher feed rate at the mill, even in this adversity. The availability of trucks on days 4 and 5 is smaller than the period for which the DDQN based method was trained, which shows an adapting

*Minerals* **2021**, *11*, x FOR PEER REVIEW 11 of 17

*Minerals* **2021**, *11*, x FOR PEER REVIEW 11 of 17

capability of the dispatching approach.

**Figure 7.** Availability of trucks during the five days of operation. **Figure 7.** Availability of trucks during the five days of operation. average queue sizes at the mill and the sulphide leach. The queue at different locations is

The framework is also efficient in avoiding queue formation. Figure 8 presents the average queue sizes at the mill and the sulphide leach. The queue at different locations is recorded hourly and averaged over each day. The plot shows that, for most of the days, the proposed approach generates smaller queues. Combined with the higher throughput obtained, this reduction in queue sizes demonstrates better fleet management. For exam‐ ple, during the initial three days, the DDQN approach improves the dispatching strategy by forming smaller queues at the mill. At the same time, the amount of material being delivered is continuously higher. On the 4th day, the proposed approach generates a The framework is also efficient in avoiding queue formation. Figure 8 presents the average queue sizes at the mill and the sulphide leach. The queue at different locations is recorded hourly and averaged over each day. The plot shows that, for most of the days, the proposed approach generates smaller queues. Combined with the higher throughput obtained, this reduction in queue sizes demonstrates better fleet management. For example, during the initial three days, the DDQN approach improves the dispatching strategy by forming smaller queues at the mill. At the same time, the amount of material being delivered is continuously higher. On the 4th day, the proposed approach generates a larger queue size at the mill, which is compensated by having considerably higher throughput at this location. recorded hourly and averaged over each day. The plot shows that, for most of the days, the proposed approach generates smaller queues. Combined with the higher throughput obtained, this reduction in queue sizes demonstrates better fleet management. For exam‐ ple, during the initial three days, the DDQN approach improves the dispatching strategy by forming smaller queues at the mill. At the same time, the amount of material being delivered is continuously higher. On the 4th day, the proposed approach generates a larger queue size at the mill, which is compensated by having considerably higher throughput at this location.

**Figure 8.** Queue sizes of trucks waiting at the mill (top) and Sulphide Leach (bottom) for the Deep DQN policy (black line) and the respective baseline (blue line): (**a**) fixed policy and (**b**) greedy pol‐ **Figure 8.** Queue sizes of trucks waiting at the mill (top) and Sulphide Leach (bottom) for the Deep DQN policy (black line) and the respective baseline (blue line): (**a**) fixed policy and (**b**) greedy policy.

DQN policy (black line) and the respective baseline (blue line): (**a**) fixed policy and (**b**) greedy pol‐

Figure 9 displays thecumulative total copper recovered at the mining complex over the five days. Interestingly, during the first three days of DES simulation, corresponding

Figure 9 displays the cumulative total copper recovered at the mining complex over the five days. Interestingly, during the first three days of DES simulation, corresponding

icy.

icy.

Figure 9 displays the cumulative total copper recovered at the mining complex over the five days. Interestingly, during the first three days of DES simulation, corresponding to the training period of the DDQN approach, the total recovered copper profile between the proposed method and the baselines is similar. However, this difference is more pronounced over the last two days, which represents the situation that the trained method has not seen. This results in 16% more copper recovered than the fixed policy and 12% more than the greedy strategy. This difference in results is even larger when the total gold recovered is compared. The DDQN method generates a 20 and 23% higher gold profile in Case 1 and Case 2, respectively, Figure 10. to the training period of the DDQN approach, the total recovered copper profile between the proposed method and the baselines is similar. However, this difference is more pro‐ nounced over the last two days, which represents the situation that the trained method has not seen. This results in 16% more copper recovered than the fixed policy and 12% more than the greedy strategy. This difference in results is even larger when the total gold recovered is compared. The DDQN method generates a 20 and 23% higher gold profile in Case 1 and Case 2, respectively, Figure 10. to the training period of the DDQN approach, the total recovered copper profile between the proposed method and the baselines is similar. However, this difference is more pro‐ nounced over the last two days, which represents the situation that the trained method has not seen. This results in 16% more copper recovered than the fixed policy and 12% more than the greedy strategy. This difference in results is even larger when the total gold recovered is compared. The DDQN method generates a 20 and 23% higher gold profile in Case 1 and Case 2, respectively, Figure 10.

*Minerals* **2021**, *11*, x FOR PEER REVIEW 12 of 17

**Figure 9.** Cumulative copper recovered for the optimized DDQN policy (black line) and the respective baseline (blue line): (**a**) Case 1 and (**b**) Case 2. **Figure 9.** Cumulative copper recovered for the optimized DDQN policy (black line) and the respective baseline (blue line): (**a**) Case 1 and (**b**) Case 2. **Figure 9.** Cumulative copper recovered for the optimized DDQN policy (black line) and the respective baseline (blue line): (**a**) Case 1 and (**b**) Case 2.

**Figure 10.** Cumulative gold recovered for the DDQN policy (black line) and the respective baseline (blue line): (**a**) fixed **Figure 10.** Cumulative gold recovered for the DDQN policy (black line) and the respective baseline (blue line): (**a**) fixed policy and (**b**) greedy policy. **Figure 10.** Cumulative gold recovered for the DDQN policy (black line) and the respective baseline (blue line): (**a**) fixed policy and (**b**) greedy policy.

#### **4. Conclusions 4. Conclusions**

policy and (**b**) greedy policy.

**4. Conclusions** This paper presents a new multi‐agent truck‐dispatching framework based on a re‐ inforcement learning framework. The approach involves the interaction between a DES, This paper presents a new multi‐agent truck‐dispatching framework based on a re‐ inforcement learning framework. The approach involves the interaction between a DES, This paper presents a new multi-agent truck-dispatching framework based on a reinforcement learning framework. The approach involves the interaction between a DES, simulating the operational events in a mining complex, and a truck-dispatching policy

based on the DDQN method. Given a pre‐defined schedule in terms of the sequence of extraction and destination policies for the mining blocks, the method improves the real‐ time truck‐dispatching performance. The DES mimics daily operations, including loading, transportation and dumping, and equipment failures. A truck delivers the material to a

extraction and destination policies for the mining blocks, the method improves the real‐ time truck‐dispatching performance. The DES mimics daily operations, including loading, transportation and dumping, and equipment failures. A truck delivers the material to a

based on the DDQN method. Given a pre-defined schedule in terms of the sequence of extraction and destination policies for the mining blocks, the method improves the realtime truck-dispatching performance. The DES mimics daily operations, including loading, transportation and dumping, and equipment failures. A truck delivers the material to a processor or waste dump, and the truck-dispatcher provides it with a different shovel path. At this point, the truck receives information about the mining complex, such as other truck locations via GPS tracking, the amount of material feeding the processing plant and queue sizes at different locations. This state information is encoded into a vector, characterizing the state of the mining complex. This vector is inputted into the DDQN neural network, which outputs action values, describing the likelihood to send the truck to each shovel. Each dispatching decision yields a reward, which is received by the agent, as a performance evaluation. Initially, the truck-agent acts randomly; as the agent experiences many situations during training, the dispatching policy is improved. Thus, when new dispatching decisions are requested, an assignment is quickly obtained by the output of the DDQN agent. It differs from previous methods that solve a different optimization repeatedly during dispatching. Instead, the only requirement is to collect information regarding the state of the mining complex. With the digitalization of the mines, obtaining the required information can be done quickly.

The method is applied to a copper–gold mining complex composed of two pits, three crushers, one waste dump, one mill and one leach-pad processing stream. The fleet is composed of four shovels, and twelve trucks that can travel between the two pits. The DDQN-based method is trained for the equivalent of three days, while the results are presented for five days of production. Two dispatching baseline policies are used for comparison to assess the capabilities of the proposed method: fixed truck-shovel allocations and a greedy approach that dispatches trucks to needy shovels with the smallest queue sizes. The results show that the DDQN-based method provides the mill processing stream with higher throughput while generating shorter queues at different destinations, which shows a better fleet utilization. Over the five days of production, the proposed policy produces 12 to 16% more copper and 20 to 23% more gold than the baseline policies. Overall, the reinforcement learning approach has shown to be effective in training truckdispatching agents, improving real-time decision-making. However, future work needs explore the development of new approaches that address the impact and adaptation of truck-dispatching decisions to changes and re-optimization of short-term extraction sequences given to the acquisition of new information in real-time and uncertainty in the properties of the materials mind.

**Author Contributions:** Conceptualization, J.P.d.C. and R.D.; methodology, J.P.d.C. and R.D.; software, J.P.d.C.; validation, J.P.d.C. and R.D; formal analysis, J.P.d.C. and R.D; investigation, J.P.d.C.; resources, R.D.; data curation, J.P.d.C.; writing—original draft preparation, J.P.d.C. and R.D.; writing review and editing, R.D.; visualization, J.P.d.C.; supervision, R.D.; project administration, R.D.; funding acquisition, R.D. Both authors have read and agreed to the published version of the manuscript.

**Funding:** This work is funded by the National Science and Engineering Research Council of Canada (NSERC) CRD Grant CRDPJ 500414-16, NSERC Discovery Grant 239019, and the COSMO mining industry consortium (AngloGold Ashanti, AngloAmerican, BHP, De Beers, IAMGOLD, Kinross, Newmont Mining and Vale).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

## **Appendix A.**

## *Appendix A.1. State Definition*

The definition of the state of the mining complex vector *S i t* encodes all attributes relevant to characterize the current status of the mining complex. Table A1 presents where the attributes are taken from and how it is represented in a vector format. Note that the encoding used here simply transforms the continuous attributes into values between 0 and 1, by a division of a large number. For discrete ones, a one-hot-encoding approach is used, where the number of categories defines the size of the vector, and a value of 1 is placed in the location corresponding to the actual category. This strategy attempts to avoid generating large gradients during gradient descent and facilitates the learning process. This idea can be further generalized, and other attributes judged relevant by the user can also be included.

**Table A1.** Attributes defining the current state of the mining complex.


*Appendix A.2. Neural Network Parameters*

**Table A2.** Reinforcement learning parameters.

