Intelligent Scheduling in Open-Pit Mining: A Multi-Agent System with Reinforcement Learning

Icarte-Ahumada, Gabriel; Herzog, Otthein

doi:10.3390/machines13050350

Open AccessArticle

Intelligent Scheduling in Open-Pit Mining: A Multi-Agent System with Reinforcement Learning

by

Gabriel Icarte-Ahumada

^1,*

and

Otthein Herzog

^2,3

¹

Faculty of Engineering and Architecture, Arturo Prat University, Iquique 1110939, Chile

²

Department of Mathematics/Informatics, University of Bremen, 28359 Bremen, Germany

³

College of Architecture and Urban Planning, Tongji University, Shanghai 200092, China

^*

Author to whom correspondence should be addressed.

Machines 2025, 13(5), 350; https://doi.org/10.3390/machines13050350

Submission received: 20 March 2025 / Revised: 22 April 2025 / Accepted: 22 April 2025 / Published: 23 April 2025

(This article belongs to the Special Issue Key Technologies in Intelligent Mining Equipment)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

An important process in the mining industry is material handling, where trucks are responsible for transporting materials extracted by shovels to different locations within the mine. The decision about the destination of a truck is very important to ensure an efficient material handling operation. Currently, this decision-making process is managed by centralized systems that apply dispatching criteria. However, this approach has the disadvantage of not providing accurate dispatching solutions due to the lack of awareness of potentially changing external conditions and the reliance on a central node. To address this issue, we previously developed a multi-agent system for truck dispatching (MAS-TD), where intelligent agents representing real-world equipment collaborate to generate schedules. Recently, we extended the MAS-TD (now MAS-TDRL) by incorporating learning capabilities and compared its performance with the original MAS-TD, which lacks learning capabilities. This comparison was made using simulated scenarios based on actual data from a Chilean open-pit mine. The results show that the MAS-TDRL generates more efficient schedules.

Keywords:

scheduling; multi-agent systems; reinforcement learning; open-pit mining

1. Introduction

In an open-pit mine, the material excavated by shovels must be transported by trucks to various locations within the mine. If a shovel extracts ore, a truck is required to transport the material and deposit it either in a crusher or onto a stockpile. If the extracted material is waste, it must be transported to a designated waste dump. This process is known as material handling and plays a critical role in open-pit mining, as it can account for up to 50% of the total operational costs [1]. Figure 1 illustrates the truck’s operational cycle, covering its journey from the loading point (shovel) to the unloading destination (crusher, stockpile, or waste dump). Trucks continuously repeat this cycle throughout the duration of a shift.

Open-pit mines function as closed systems where operations are affected by a dynamic environment. Factors such as equipment failures, changing weather conditions, or variations in road conditions directly impact equipment efficiency and availability, resulting in delays in material handling [3]. Given this variability and the inherent stochastic nature of the process, determining the next destination for a truck becomes a complex challenge.

Currently, centralized systems use operations research techniques, heuristic methods, or simulation models to support the material handling process [4]. Most of these systems employ a multi-stage approach [1], in which an initial stage establishes a guiding framework. Subsequently, this framework and a dispatching criteria (e.g., real-time shovel production levels) are used to determine truck assignments as needed.

Despite the implementation of such solutions, material handling in open-pit mines remains inefficient. Situations frequently arise where trucks are queued at shovels or crushers, while other shovels remain idle waiting for trucks. These inefficiencies increase costs and hinder the fulfillment of production targets. Researchers have indicated that existing systems are unable to fully address these challenges because they do not accurately model equipment activities [5] and often rely on estimated data [6,7,8,9].

Icarte et al. [2] propose an alternative approach to optimize the coordination of mining equipment operations. This solution is based on a multi-agent system (MAS), in which intelligent agents represent physical mining equipment. Through agent interactions, individualized schedules are generated to improve overall system efficiency.

This paper explores the application of reinforcement learning in a multi-agent system for truck-shovel scheduling. The proposed approach was tested in a simulated open-pit mining environment, comparing scheduling efficiency with and without learning capabilities.

The remainder of this paper is organized as follows: Section 2 reviews related work in truck dispatching and scheduling. Section 3 presents the problem formulation and methodology. Section 4 discusses the results obtained, and Section 5 provides the conclusion and future research directions.

2. Related Works

2.1. Scheduling in Open-Pit Mining

Most of the reports on the truck dispatching problem in open-pit mines follow an allocation model in which the destination of a truck is determined when it is required. Only a few publications have modeled the problem as a scheduling problem. For instance, Chang et al. [7] and Patterson et al. [5] propose algorithms that generate an initial schedule that is improved using a metaheuristic method. Their results show that the algorithms generate schedules for different size instances with good results and performance in practical time frames (for the mining industry). Icarte et al. [2] present a multi-agent system for truck dispatching (MAS-TD) with agents representing trucks and shovels. The agents interact with each other to generate schedules. Their results show that the MAS-TD provides schedules in practical time frames and can handle environment dynamics. Zhang et al. [10] propose a mixed-integer programming model and a tabu search algorithm to optimize autonomous truck trips and demonstrate its effectiveness in a real-time scheduling system tested in a coal mine in Inner Mongolia, China.

2.2. Reinforcement Learning in Mining

In recent years, optimization in mining operations scheduling has advanced significantly with the integration of machine learning techniques, particularly those based on reinforcement learning and deep reinforcement learning. However, only a few studies have explored the use of these techniques to improve the allocation of trucks to shovels in open-pit mines, addressing challenges related to operational efficiency, cost reduction, and environmental impact mitigation. Some of these studies are briefly summarized below.

One of the predominant approaches has been the application of Q-learning to fleet dispatching in open-pit mines with the aim of reducing greenhouse gas emissions by optimizing fuel consumption. According to Huo et al. [11], the implementation of this strategy reduced emissions by more than 30% compared to conventional fixed scheduling methods. In parallel, stochastic optimization of mining complexes has been addressed by combining of simulated annealing and reinforcement learning. In this context, adaptive heuristic selection has led to execution time reductions of up to 80% compared to conventional methods, as reported by Yaakoubi and Dimitrakopoulos [12].

The use of multi-agent deep reinforcement learning techniques has demonstrated significant progress in solving the problem of dynamic truck dispatching in large mines with heterogeneous fleets. Zhang et al. [13] developed a Deep Q-Network (DQN) model trained on a simulator calibrated with real mining data, achieving a 5.56% increase in productivity compared to widely used industrial approaches. Similarly, shovel allocation and short-term production planning have been addressed using Deep Q-Learning, integrating discrete event simulations to model uncertainties in cycle times and operational failures. Noriega and Pourrahimian [14] demonstrate that adopting these methods enables the generation of more robust allocation policies that ensure that production targets are met even in dynamic environments.

Another approach is the integration of reinforcement learning with mining fleet management and short-term production planning updates. De Carvalho and Dimitrakopoulos [15] implemented a framework based on Actor-Critic Reinforcement Learning, enabling real-time data incorporation from sensors to continuously update uncertainty estimates of the extracted ore quality. This strategy led to a 47% improvement in cash flow by adaptively optimizing the allocation of trucks to shovels based on the most recent available data.

Finally, the optimization of ore blending scheduling in open-pit mines has been modeled as a Partially Observable Markov Decision Process (POMDP) and addressed using multi-agent deep deterministic policy gradient (MADDPG). Feng et al. [16] applied this technique to improve accuracy and computational speed, achieving significantly higher efficiency compared to traditional multi-objective optimization algorithms.

The reviewed studies highlight the positive impact of advanced learning techniques in mining scheduling optimization, leading to significant improvements in operational efficiency, cost reduction, and real-time decision-making. While the adoption of these methods still faces challenges in terms of scalability and generalization across different mining environments, the results obtained so far underscore their potential to transform planning and management processes in the mining industry.

3. Methodology

The methodology used in this study was structured around three key stages: defining the problem, designing and implementing the multi-agent system with reinforcement learning capabilities, and evaluating its performance. The following subsections provide a detailed description of this methodology.

3.1. Problem Statement

In our model, trucks can be allocated to any shovel, while each shovel is assigned to a specific pit. Throughout the shift, the material extracted by a shovel must be transported to a designated destination. Each shovel can load only one truck at a time. At a crusher, unloading is limited to a single truck at a time, whereas at waste dumps and stockpiles, multiple trucks can unload simultaneously. Table 1 shows the notation for the sets, indexes, parameters, and decision variables used in the model.

The objective function (1) seeks to minimize the total operational cost, measured in terms of time, required for shovels and trucks to complete their tasks. Constraint (2) ensures that each time slot l on a shovel is assigned to at most one truck, while constraint (3) guarantees that no more than one truck is loaded by a shovel at any given time. Similarly, constraint (4) establishes that crushers can accommodate only one truck for unloading at a time.

Constraint (5) dictates that the unloading process, represented by μ_s,l, can only begin after the loading has started at λ_s,l and after considering both the loading duration C_s and the travel time to the destination C_sj. Constraint (6) ensures that a truck’s next loading operation at l′ can only occur after it has completed unloading at l. Furthermore, constraint (7) and (8) enforce a sequential order for each truck, ensuring that each truck follows a defined path with only one predecessor and one successor.

Constraint (9) enforces that the planned loading targets for each shovel are met as outlined in the production plan. Constraint (10) considers the initial travel time of a truck to its first loading point at the beginning of the shift. Additionally, constraint (11) specifies that all trucks must initiate operations from a designated dummy pit, which represents their initial location, while constraint (12) ensures that trucks also conclude their shift at a dummy pit. These constraints define a structured scheduling framework that maintains logical sequencing, operational feasibility, and efficiency in material handling operations.

M i n \sum_{\forall s, l} (C_{s}^{j} + C_{s} + C_{j}^{s} + C_{j})

(1)

\sum_{\forall r} X_{r, s, l} \leq 1 \forall l, s

(2)

λ_{s, l + 1} - λ_{s, l} \geq C_{s} \forall l, s

(3)

μ_{s, l + 1} - μ_{s, l} \geq C_{j} \forall s, l

(4)

μ_{s, l} \geq λ_{s, l} + C_{s} + C_{j}^{s} \forall l, s

(5)

λ_{s^{'}, l^{'}} \geq μ_{s, l} + C_{s^{'}}^{j} + C_{j} - M (2 - λ_{r, s, l, s^{'} l^{'}}^{s e q} - X_{r, s, l}) \forall r, l, s, s^{'} l^{'}

(6)

X_{r, s, l} = \sum_{s^{'}, l^{'}} λ_{r, s, l, s^{'} l^{'}}^{s e q} \forall r, l, s

(7)

X_{r, s, l} = \sum_{s^{'}, l^{'}} λ_{r, s^{'} l^{'}, s, l}^{s e q} \forall r, l, s

(8)

\sum_{r, l} X_{r, s, l} A_{r} \geq δ_{s} \forall s

(9)

λ_{s, 1} \geq C_{s}^{r} - M (1 - λ_{r, 0,0, s, 1}^{s e q}) \forall r, s

(10)

\sum_{s, l} λ_{r, 0,0, s, l}^{s e q} = 1 \forall r

(11)

\sum_{s, l} λ_{r, s, l, 0,0}^{s e q} = 1 \forall r

(12)

3.2. The MAS-TDRL

The problem is modeled as a multi-agent system in which the key entities of the material handling in open-pit mines—shovels, trucks, and unloading points—are represented as autonomous agents. Each agent makes decisions based on local observations, interactions with other agents, and learned policies. The following subsections describe the agents, the coordination mechanism, the agent decision-making, and the learning algorithm.

3.2.1. Agents

The MAS-TDLR is designed to achieve the targets of the production plan while minimizing operational costs. This approach provides a more realistic and adaptive alternative to centralized truck dispatching systems. The MAS-TDLR consists of intelligent agents that represent key mining equipment, each with a specific role in optimizing material handling operations.

A truckAgent represents an individual truck within the mining environment. Its primary function is to generate an optimized schedule for its operations while minimizing costs. To achieve this, it considers key parameters such as capacity, loaded and empty velocity, spotting time, and unloading duration. Additionally, the truckAgent uses the mine layout for route planning and participates in negotiation processes to determine optimal assignments.

A shovelAgent models a real-world shovel and focuses on scheduling its operations in alignment with the targets of the production plan. It considers factors such as capacity, digging and loading velocities, and the designated destination for the extracted material.

An unloadingPointAgent represents facilities where material is deposited, including crushers, stockpiles, and waste dumps. Its role is to manage the scheduling of unloading operations to ensure an efficient flow of materials. The primary parameter is the number of trucks that can unload simultaneously at a given location.

By leveraging these specialized agents and their interactions, the MAS-TDLR enhances equipment coordination, optimizes resource allocation, and improves overall material handling efficiency.

3.2.2. Coordination Mechanism

To generate schedules, the agents negotiate using an enhanced version of the Contract Net Protocol [17], as described in [2]. In the enhanced protocol, a shovelAgent acts as the initiator and the truckAgents act as participants. The shovelAgents initiate the negotiation process in parallel, which requires the protocol to handle concurrent negotiations.

To support this, the standard Contract Net Protocol (CNP) is extended by incorporating a confirmation phase. The process begins when each shovelAgent sends a call-for-proposal (CFP) message to all truckAgents, indicating the time slot during which it will be available for loading. Upon receiving the CFP, a truckAgent responds with a propose message that includes its estimated time of arrival at the shovel and the associated cost of performing the required operation. If the operation cannot be accommodated in the truck’s schedule, the truckAgent responds with a refuse message.

While waiting for responses, each shovelAgent collects and stores incoming proposals, keeping track of the number of messages received from truckAgents. The shovelAgent continues collecting proposals either until a predefined deadline expires or until it has received responses (propose or refuse messages) from all truckAgents. Two outcomes are then possible: if no proposals are received, the negotiation process ends without a contract; otherwise, the shovelAgent selects the best proposal and sends a message with the performative requestConfirmation to the corresponding truckAgent.

At this point, three scenarios can occur: the shovelAgent receives an acceptConfirmation, a refuseConfirmation, or no response within the deadline. In the first case, the shovelAgent sends rejectProposal messages to all unselected truckAgents, and the negotiation concludes with a contract. In the second and third cases, the shovelAgent removes the unconfirmed proposal from its storage, selects the next best option, and repeats the confirmation cycle. If there are no proposals left, the negotiation process ends without a contract.

The enhanced Contract Net Protocol allows agents to conduct multiple negotiations simultaneously, ensuring better coordination and scheduling efficiency. Figure 2 illustrates the workflow of the protocol, while Table 2 presents an example of a truck schedule generated through this negotiation mechanism.

3.2.3. Agent Decision-Making

ShovelAgents must evaluate the received proposals and select the most suitable one. This decision is guided by the utility function proposed in [18], which prioritizes proposals that reduce shovel idle time while minimizing the total cost of truck operations.

TruckAgents, on the other hand, must make two decisions. The first is whether to submit a proposal in response to a call-for-proposal message from a shovelAgent. To determine this, a truckAgent checks its schedule to see if there is an available time slot that matches the offered loading time. If a slot is available, it calculates the total time required to complete the operations and assesses whether the task can be accommodated within its schedule. If the request cannot be fulfilled, the truckAgent sends a refuse message; otherwise, it submits a proposal.

The second decision is to confirm or withdraw a previously submitted proposal. In this case, the truckAgent evaluates the idle time of the shovel, as indicated in the requestConfirmation message sent by the shovelAgent, as well as its ongoing negotiations. If the idle time of the shovel is at least one minute, the truckAgent sends an acceptConfirmation message to proceed with the assignment. However, if the idle time is less than one minute, the truckAgent checks whether it is involved in another negotiation that offers a more favorable outcome (i.e., lower operational cost). If a better option is available, it sends a refuseConfirmation message; otherwise, it confirms the initial proposal by sending an acceptConfirmation message.

3.2.4. Reinforcement Learning

The truckAgent in the MAS-TDLR uses Q-learning as a reinforcement learning technique to improve its decision-making in response to call-for-proposal messages from shovelAgents. The objective is to determine whether to send a propose or refuse message based on learned experiences. Q-learning was selected over more complex reinforcement learning algorithms such as Deep Q-Networks (DQN) due to its low computational complexity and ease of integration in a distributed, decentralized environment where each agent learns independently.

In the implemented Q-learning approach, the state representation is defined only by the shovel identifier. This simplification was intentionally chosen to maintain a compact state space, ensure convergence within a reasonable number of episodes, and preserve the interpretability of the learned behavior. Furthermore, it allows truckAgents to learn which shovelAgents are more likely to reject their proposals and to avoid unproductive negotiations, reducing message traffic and decision time. Each shovelAgent represents a distinct state, and the truckAgent maintains a Q-table that maps each shovel to its corresponding Q-values. The Q-table consists of two values for each shovel: one for the action of refusing to send a proposal and another for the action of sending a proposal. This formulation ensures that each truckAgent learns a specific policy for each shovel it interacts with. Mathematically, the Q-values are represented as follows:

Q (s, a) = e x p e c t e d r e w a r d f o r t a k i n g a c t i o n a i n state

where s represents the shovel identifier, and a represents the possible actions, either propose (1) or refuse (0). When a truckAgent receives a CFP, it retrieves the Q-values associated with the sending shovelAgent and applies an epsilon-greedy policy to determine the next action:

a = \{\begin{matrix} {\arg m a x}_{a} Q (s, a), w i t h p r o b a b i l i t y 1 - ϵ \\ r a n d o m a c t i o n, w i t h p r o b a l i t y ϵ \end{matrix}

where ϵ = 0.05 represents the exploration rate, ensuring that the truckAgent occasionally tries different actions to avoid local optima. The value of ε was selected after several empirical trials to ensure a balanced trade-off between exploration and exploitation in the simplified state space defined by shovel identifiers.

The reward function directly influences how the truckAgent updates its Q-values. The implemented reward mechanism considers two primary factors: successful negotiations and proposal rejections. This was intentionally designed to encourage successful negotiations and discourage repeated rejections, allowing truckAgents to adapt their behavior to avoid unproductive interactions. The reward function is defined as follows:

r (s, a) = \{\begin{matrix} + 5.0 i f t h e p r o p o s a l i s a c c e p t e d \\ - 0.1 \times (n_{r e j e c t i o n s} + 1), i f t h e p r o p o s a l i s r e j e c t e d \end{matrix}

where n_rejections represents the number of times the truckAgent’s proposals have been rejected by the same shovelAgent. This approach penalizes repeated rejections and discourages truckAgents from persistently making unsuccessful proposals. The Q-values are updated using the standard Q-learning update rule [18]:

Q (s, a) \leftarrow Q (s, a) + \propto [r + γ {m a x}_{a ’} Q (s ’, a ’) - Q (s, a)]

where

α = 0.01 is the learning rate which controls how much the new experience influences the existing knowledge;
γ = 0.98 is the discount factor, which prioritizes long-term rewards over immediate gains;
r is the reward received from the environment;
max_a_′Q(s′,a′) is the highest expected future reward for the next possible action.

3.2.5. MAS-TDLR Implementation

The MAS-TDLR was implemented using the Java Agent DEvelopment Framework (JADE 4.6.0) [19], a robust platform specifically designed for developing multi-agent systems. JADE provides a comprehensive set of tools, including predefined interaction protocols, agent behaviors, and graphical utilities for monitoring agent interactions and evaluating system performance. Built on Java 1.8, JADE ensures cross-platform compatibility across different operating systems. All experiments were conducted on a MacBook Pro (Apple Inc., Cupertino, CA, USA) powered by an Apple M2 Pro chip and equipped with 16 GB of RAM.

3.3. Simulation and Evaluation Metrics

For the evaluation, three simulated scenarios were created based on real data from an open-pit copper mine in Chile. These scenarios feature heterogeneous fleets of trucks and shovels operating in twelve-hour shifts. Table 3 provides additional details on the simulated scenarios. The actual data, including truck and shovel velocities and their respective capacities, were used to define the agents’ properties. Table 4 presents the property sets of the agents. The modeled transport infrastructure consists of 638 nodes and 1330 edges, reflecting the layout and connectivity of the mine’s operational network.

The simulation of three scenarios allows the analysis of the scalability of the proposed system. In particular, the largest scenario—100 trucks and 25 shovels—closely reflects the operational scale of a real open-pit mine in northern Chile. While actual mining operations involve other types of equipment, such as drills, water trucks, and personnel carriers, these were not included in the simulation since the focus of this study is on the coordination and scheduling of the haulage and loading fleet.

To assess system performance, several evaluation criteria were analyzed. Total material transported (tons) was measured to determine the overall productivity of the system. Cost (hours) represents the travel time required for the trucks to complete their assigned tasks. Efficiency (tons per hour) was calculated to evaluate the effectiveness of material transportation over time. Additionally, the idle time of trucks and shovels (hours) was examined to assess resource utilization and identify potential bottlenecks in scheduling.

Computational efficiency was also a factor in the evaluation. The computation time required to generate schedules (minutes) was recorded to determine the feasibility of real-time implementation. In addition, negotiation dynamics were analyzed through metrics such as the total number of negotiations and refuse messages, which provided insight into the decision-making process of agents. Finally, the average negotiation time (milliseconds) was monitored to evaluate the responsiveness of the system and its ability to adapt to dynamic conditions.

In addition to system performance and negotiation efficiency, evaluation criteria were also defined to assess the learning process of the truckAgents. The effectiveness of the reinforcement learning mechanism was examined by tracking the evolution of decision-making strategies over multiple runs. Specifically, the number of refuse messages over time was analyzed as an indicator of improved decision making. The convergence of Q-values in the learning model was monitored to determine whether the truckAgents were successfully optimizing their dispatching policies. Additionally, the impact of learning on efficiency and cost was evaluated, comparing performance with and without learning to quantify the benefits of adaptive decision making. These metrics provide insight into how well the agents adapted to dynamic conditions and whether the reinforcement learning approach contributed to more efficient truck dispatching.

4. Results and Discussion

This section presents the results from three different perspectives: production and scheduling efficiency, negotiation process and computation time, and the learning process. The first part analyzes how reinforcement learning (RL) affects material transportation, efficiency, and resource utilization. The second part focuses on how RL impacts the negotiation process among agents and the required computational time to generate schedules. Finally, the third part provides insights into the learning dynamics of the system and its effects on decision making.

4.1. Analysis from the Perspective of Production and Scheduling

The experiments evaluated the performance of the multi-agent system applied to schedule generation for truck and shovel operations in an open pit. Scenarios with and without the application of reinforcement learning were compared to assess its impact on operational efficiency. Figure 3 shows the material transported with and without RL in the three scenarios.

In terms of production, the total amount of material transported did not vary significantly between scenarios with and without RL. A marginal advantage of 0.94% in transported material was observed in the non-RL system, but only in the scenario with 100 trucks and 25 shovels.

Another relevant aspect is the idle time of trucks and shovels. An increase in truck waiting time was observed in scenarios in which RL was applied, especially in scenario 3 (126.82 h compared to 86.05 h). This suggests that reinforcement learning may lead to a higher number of decisions resulting in idle trucks, which could negatively impact productivity. On the other hand, the idle time of shovels remained relatively stable or even slightly decreased in scenarios in which RL was applied, indicating that the allocation of trucks to shovels was more effective.

In terms of operational costs in working hours, the use of RL reduced total operating time in all scenarios. In the case of 100 trucks and 25 shovels, the non-RL system required 481.78 h, while with RL, it was reduced to 459.55 h, representing an improvement in operational efficiency. Figure 4 shows the cost to transport material in each scenario with and without RL.

From an operational efficiency standpoint, reinforcement learning proved beneficial in optimizing total operational time. The efficiency metric—defined as tons transported per hour—improved across all scenarios in which RL was applied. In particular, the large-scale configuration with 100 trucks and 25 shovels increased efficiency by 3.84% (from 361.67 to 375.59 tons per hour), reflecting better utilization of equipment. While RL increased the idle time of the trucks in these scenarios, the total operational time was reduced, indicating a more selective and effective task assignment strategy. The system prioritized higher-quality negotiations, which reduced failed assignments but allowed some trucks to remain idle. This trade-off does not reflect a loss of productivity, but rather a shift in decision-making behavior toward better coordination.

4.2. Analysis of the Negotiation Process and Computation Time

From the perspective of the negotiation process among agents, the results show that the number of negotiations carried out was not significantly affected by the use of RL. However, the number of refuse messages sent by truckAgents in response to call-for-proposals messages sent by shovelAgents increased significantly. In the scenario with 100 trucks and 25 shovels, the number of refuse messages increased from 20,385 to 23,439 when RL was applied, suggesting that truckAgents became more selective in sending proposals. This behavior is consistent with the learning strategy, in which truckAgents try to optimize their performance based on past experiences.

The average time per negotiation remained relatively stable across all scenarios, with a marginal increase when using RL. This suggests that the negotiation mechanism based on the Contract Net Protocol remains efficient even when agents incorporate learning.

Regarding the computation time required to generate schedules, the results show that the impact of RL does not vary too much. In the case of 15 trucks and 5 shovels, computation time decreased (2.73%) with RL (28.59 min vs. 27.81 min), for the scenario of 50 trucks and 10 shovels computation time decreased by 2.28% (28.13 min vs. 27.49) and in the scenario of 100 trucks and 25 shovels the computation increased by 1.66% (28.39 min vs. 28.86 min).

4.3. Analysis of the Learning Process

Figure 5 shows the performance of Q-learning, considering different phases of the learning process for the largest scenario (100 trucks and 25 shovels). In the initial phase, the truckAgents respond with proposals to all call-for-proposal messages from shovelAgents and receive many rejections, gradually accumulating negative penalties. During the progressive learning phase, the truckAgents learn to avoid sending proposals to shovelAgents who have repeatedly rejected them, resulting in a reduction in rejections and a gradual increase in acceptances. In the forgetting phase, after a certain number of episodes (in this case, every 20 episodes), the truckAgents “forget” some of the shovelAgents that previously rejected them, allowing it to try again and create new opportunities for acceptance. As a result the accumulated reward grows in a stepwise manner, reflecting a mix of learning and forgetting. Periods of stability can be observed after successful learning, followed by slight declines when the truckAgent reintroduces attempts with previously forgotten shovels. This model represents a realistic strategy in which the truckAgents can dynamically adjust its behavior, avoiding permanent decision-making deadlocks.

By incorporating RL, truckAgents gradually refine their bidding strategies based on past negotiations. As a result, they become more selective when deciding whether to send a propose or refuse message. This is reflected in the increased number of refuse messages in scenarios in which RL was applied.

One effect of RL is that truckAgents send more refuse messages instead of proposals. As a result, shovelAgents receive fewer proposals to evaluate, reducing their decision-making time. With fewer proposals to evaluate, shovelAgents can request confirmation more quickly, speeding up the negotiation process.

In addition, when a truckAgent sends a refuse message in a negotiation where it could have sent a propose, it participates in fewer concurrent negotiations. This means that it has a higher chance of confirming its participation with a shovelAgent, since it is not waiting for a potentially better offer in another negotiation. Without RL, a truckAgent could be in multiple negotiations simultaneously, hesitating to confirm one while waiting to win another. If it lost that other negotiation, it could result in the agent failing to secure any successful assignments. Thus, RL enables a more focused and committed negotiation strategy, potentially increasing the overall success rate of truck assignments.

5. Conclusions and Future Work

In this work, MAS-TDLR, a multi-agent system for truck dispatching with learning capabilities was developed to generate schedules for trucks and shovels in the mining industry. The MAS-TDLR allows trucks and shovels to interact through a negotiation protocol based on the Contract Net Protocol, optimizing task allocation and reducing operational costs. The performance of the MAS-TDLR with RL was evaluated in comparison to a MAS without RL.

The main result demonstrated that the incorporation of reinforcement learning improved the efficiency of the generated schedules compared to a multi-agent system without RL. Thanks to its learning capability, the truckAgents were able to make better decisions in selecting which negotiations to participate in, avoiding failed negotiations and increasing their chances of reaching a beneficial agreement. Despite the addition of a learning phase, the schedule generation time did not increase significantly because the acquired knowledge allowed for the reduction of certain calculations during the negotiation, optimizing the process without affecting the performance.

For future work, we propose two main directions. From the learning perspective, we aim to compare our current implementation with other techniques such as Deep Q-Networks (DQN), explore more expressive state representations—including variables such as truck location and queue lengths—and evaluate alternative reward functions that integrate operational efficiency metrics. From the multi-agent system and negotiation perspective, we plan to improve the coordination mechanism and extend the simulation scenarios to include auxiliary equipment such as drills, water trucks, and personnel carriers, as well as the evolving dynamics of the mine. These extensions will help capture the complexity of real-world operations more accurately.

Author Contributions

Conceptualization, G.I.-A. and O.H.; methodology, G.I.-A.; software, G.I.-A.; validation, G.I.-A. and O.H.; formal analysis, G.I.-A.; investigation, G.I.-A.; resources, G.I.-A.; data curation, G.I.-A.; writing—original draft preparation, G.I.-A. and O.H.; writing—review and editing, O.H.; supervision, O.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MAS	Multi-Agent System
MAS-TD	Multi-Agent System for Truck Dispatching
RL	Reinforcement Learning
MAS-TDLR	Multi-Agent System for Truck Dispatching with Reinforcement Learning
CNP	Contract Net Protocol

References

Alarie, S.; Gamache, M. Overview of Solution Strategies Used in Truck Dispatching Systems for Open Pit Mines. Int. J. Surf. Min. Reclam. Environ. 2002, 16, 59–76. [Google Scholar] [CrossRef]
Icarte, G.; Rivero, E.; Herzog, O. An Agent-based System for Truck Dispatching in Open-pit Mines. In Proceedings of the 12th International Conference on Agents and Artificial Intelligence—Volume 1: ICAART; Rocha, A., Steels, L., van den Herik, J., Eds.; SciTePress: Valleta, Malta, 2020; pp. 73–81. [Google Scholar]
Adams, K.K.; Bansah, K.K. Review of Operational Delays in Shovel-Truck System of Surface Mining Operations. In Proceedings of the 4th UMaT Biennial International Mining and Mineral Conference, Tarkwa, Ghana, 3–6 August 2016; pp. 60–65. [Google Scholar]
Icarte, G.; Herzog, O. A multi-agent system for truck dispatching in an open-pit mine. In Abstracts of the Second International Conference Mines of the Future 13 & 14 June 2019, Institute of Mineral Resources Engineering, RWTH Aachen University; Lottermoser, B., Ed.; Verlag Mainz: Aachen, Germany, 2019. [Google Scholar]
Patterson, S.R.; Kozan, E.; Hyland, P. Energy efficient scheduling of open-pit coal mine trucks. Eur. J. Oper. Res. 2017, 262, 759–770. [Google Scholar] [CrossRef]
Newman, A.M.; Rubio, E.; Caro, R.; Weintraub, A.; Eurek, K. A review of operations research in mine planning. Interfaces 2010, 40, 222–245. [Google Scholar] [CrossRef]
Chang, Y.; Ren, H.; Wang, S. Modelling and optimizing an open-pit truck scheduling problem. Discret. Dyn. Nat. Soc. 2015, 2015, 745378. [Google Scholar] [CrossRef]
Da Costa, F.P.; Souza, M.J.F.; Pinto, L.R. Um modelo de programação matemática para alocação estática de caminhões visando ao atendimento de metas de produção e qualidade. Rem Rev. Esc. De Minas 2005, 58, 77–81. [Google Scholar] [CrossRef]
Krzyzanowska, J. The impact of mixed fleet hauling on mining operations at Venetia mine. J. South. Afr. Inst. Min. Metall. 2007, 107, 215–224. [Google Scholar]
Zhang, X.; Guo, A.; Ai, Y.; Tian, B.; Chen, L. Real-Time Scheduling of Autonomous Mining Trucks via Flow Allocation-Accelerated Tabu Search. IEEE Trans. Intell. Veh. 2022, 7, 466–479. [Google Scholar] [CrossRef]
Huo, D.; Sari, Y.A.; Kealey, R.; Zhang, Q. Reinforcement Learning-Based Fleet Dispatching for Greenhouse Gas Emission Reduction in Open-Pit Mining Operations. Resour. Conserv. Recycl. 2023, 188, 106664. [Google Scholar] [CrossRef]
Yaakoubi, Y.; Dimitrakopoulos, R. Learning to schedule heuristics for the simultaneous stochastic optimization of mining complexes. Comput. Oper. Res. 2023, 159, 106349. [Google Scholar] [CrossRef]
Zhang, C.; Odonkor, P.; Zheng, S.; Khorasgani, H.; Serita, S.; Gupta, C. Dynamic Dispatching for Large-Scale Heterogeneous Fleet via Multi-agent Deep Reinforcement Learning. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 1436–1441. [Google Scholar] [CrossRef]
Noriega, R.; Pourrahimian, Y. Shovel Allocation and Short-Term Production Planning in Open-Pit Mines Using Deep Reinforcement Learning. Int. J. Min. Reclam. Environ. 2024, 38, 442–459. [Google Scholar] [CrossRef]
de Carvalho, J.P.; Dimitrakopoulos, R. Integrating short-term stochastic production planning updating with mining fleet management in industrial mining complexes: An actor-critic reinforcement learning approach. Appl. Intell. 2023, 53, 23179–23202. [Google Scholar] [CrossRef]
Feng, Z.; Liu, G.; Wang, L.; Gu, Q.; Chen, L. Research on the Multiobjective and Efficient Ore-Blending Scheduling of Open-Pit Mines Based on Multiagent Deep Reinforcement Learning. Sustainability 2023, 15, 5279. [Google Scholar] [CrossRef]
Smith, R.G. The Contract Net Protocol: High-level communication and control in a distributed problem solver. IEEE Trans. Comput. 1980, C-29, 1104–1113. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Bellifemine, F.; Caire, G.; Greenwood, D. Developing Multi-Agent Systems with JADE; John Wiley & Sons: Chichester, UK, 2007. [Google Scholar] [CrossRef]

Figure 1. The truck cycle [2].

Figure 2. Enhanced Contract Net Protocol [2].

Figure 3. Comparison of transported material.

Figure 4. Travel time to transport the material.

Figure 5. Performance of Q-learning.

Table 1. Formulation notation.

Set	Index	Description
S	s	Shovels
R	r	Trucks
Ls	l	Slot time of the shovel s
J	j	Destinations
Parameters
C_s		Loading time of shovel s
C_j		Unloading time at the destination j
$C_{j}^{s}$		Travel time from shovel s to the destination j
$C_{s}^{r}$		Travel time of truck r to shovel s (only at the beginning of the shift)
$C_{s ’}^{j}$		Travel time from the destination j to next shovel s’
A_r		Truck capacity
$δ_{s}$		The target of extracted material by shovel s
M		Sufficiently large positive number
Decision Variables
X_r,s,l		1 if the truck r loads at shovel s in the time slot l, otherwise 0
$λ_{s, l}$		Loading start time of shovel s in time slot l
$μ_{s, l}$		Unloading start time of material extracted by shovel s in time slot l
$λ_{r, s, l, s ’, l ’}^{s e q}$		1 if truck r was loaded by shovel s in time slot l before being loaded in shovel s’ and slot time l’. Otherwise 0.

Table 2. Example of schedule created for a truck.

Assignment	Destination	Start Time of the Trip	Arrival Time	Start Time of Loading/Unloading	End Time of the Assignment
0	Shovel.10	05:57:01	06:10:23	06:10:36	06:15:12
1	WasteDump.07	06:15:12	06:32:33	06: 38:23	06:40:23
2	Shovel.01	06:45:25	06:58:47	07:00:10	07:05:35
3	WasteDump.05	07:05:35	07:22:24	07:26:38	07:27:12
4	Shovel.03	07:37:44	07:41:25	07:43:18	07:48:32

Table 3. Simulated scenarios.

Scenario ID	Number of Trucks	Number of Shovels
1	15	5
2	50	10
3	100	25

Table 4. Property values for the simulations.

Equipment	Property	Unit	Min Value	Max Value
Trucks	Velocity loaded	[km/hr]	20	25
	Velocity empty	[km/hr]	40	55
	Capacity	[tons]	300	370
	Spotting time	[sec]	20	80
	Current load	[tons]	0	370
Shovel	Capacity	[tons]	35	80
	Load time	[sec]	8	30
	Dig time	[sec]	8	20
	Destination	Location at mine (crusher, stockpile, or waste dump)
Crusher	Equipment discharging	[number of trucks]	1	1
Stockpile	Equipment discharging	[number of trucks]	1	20
Waste Dump	Equipment discharging	[number of trucks]	1	20

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Icarte-Ahumada, G.; Herzog, O. Intelligent Scheduling in Open-Pit Mining: A Multi-Agent System with Reinforcement Learning. Machines 2025, 13, 350. https://doi.org/10.3390/machines13050350

AMA Style

Icarte-Ahumada G, Herzog O. Intelligent Scheduling in Open-Pit Mining: A Multi-Agent System with Reinforcement Learning. Machines. 2025; 13(5):350. https://doi.org/10.3390/machines13050350

Chicago/Turabian Style

Icarte-Ahumada, Gabriel, and Otthein Herzog. 2025. "Intelligent Scheduling in Open-Pit Mining: A Multi-Agent System with Reinforcement Learning" Machines 13, no. 5: 350. https://doi.org/10.3390/machines13050350

APA Style

Icarte-Ahumada, G., & Herzog, O. (2025). Intelligent Scheduling in Open-Pit Mining: A Multi-Agent System with Reinforcement Learning. Machines, 13(5), 350. https://doi.org/10.3390/machines13050350

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Intelligent Scheduling in Open-Pit Mining: A Multi-Agent System with Reinforcement Learning

Abstract

1. Introduction

2. Related Works

2.1. Scheduling in Open-Pit Mining

2.2. Reinforcement Learning in Mining

3. Methodology

3.1. Problem Statement

3.2. The MAS-TDRL

3.2.1. Agents

3.2.2. Coordination Mechanism

3.2.3. Agent Decision-Making

3.2.4. Reinforcement Learning

3.2.5. MAS-TDLR Implementation

3.3. Simulation and Evaluation Metrics

4. Results and Discussion

4.1. Analysis from the Perspective of Production and Scheduling

4.2. Analysis of the Negotiation Process and Computation Time

4.3. Analysis of the Learning Process

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI