*Article* **Diagnosis of Problems in Truck Ore Transport Operations in Underground Mines Using Various Machine Learning Models and Data Collected by Internet of Things Systems**

**Sebeom Park <sup>1</sup> , Dahee Jung <sup>1</sup> , Hoang Nguyen 2,3 and Yosoon Choi 1,\***


**Abstract:** This study proposes a method for diagnosing problems in truck ore transport operations in underground mines using four machine learning models (i.e., Gaussian naïve Bayes (GNB), k-nearest neighbor (kNN), support vector machine (SVM), and classification and regression tree (CART)) and data collected by an Internet of Things system. A limestone underground mine with an applied mine production management system (using a tablet computer and Bluetooth beacon) is selected as the research area, and log data related to the truck travel time are collected. The machine learning models are trained and verified using the collected data, and grid search through 5-fold cross-validation is performed to improve the prediction accuracy of the models. The accuracy of CART is highest when the parameters leaf and split are set to 1 and 4, respectively (94.1%). In the validation of the machine learning models performed using the validation dataset (1500), the accuracy of the CART was 94.6%, and the precision and recall were 93.5% and 95.7%, respectively. In addition, it is confirmed that the F1 score reaches values as high as 94.6%. Through field application and analysis, it is confirmed that the proposed CART model can be utilized as a tool for monitoring and diagnosing the status of truck ore transport operations.

**Keywords:** bluetooth beacon; classification and regression tree; gaussian naïve bayes; k-nearest neighbors; support vector machine; transport route; transport time; underground mine

## **1. Introduction**

Because the productivity and profits of mines can vary greatly depending on the design and planning of the production process, optimal operation methods and equipment utilization strategies are needed to maximize productivity and equipment efficiency and minimize operating costs [1–5]. The cost of transporting ore and waste accounts for over 50% of the total mine operational cost, therefore, it is crucial to design and operate the transport system efficiently [6]. Methods to improve the productivity and efficiency of the mine transport system are divided broadly into two types: methods to properly establish an operational plan so that the mine can be operated effectively, and methods to monitor and manage the site to see whether the established plan is being well implemented.

Recently, various mathematical decisions and deterministic and probabilistic simulation models have been proposed by researchers to establish an operational plan, such as optimizing the operational method and equipment allocation plan of the mine transport system and minimizing material handling costs [4,7–13]. Since the first implementation of discrete event simulation by Rist to solve problems related to ore transport in mines, many researchers have conducted research on discrete event simulation [14]. Salama and

**Citation:** Park, S.; Jung, D.; Nguyen, H.; Choi, Y. Diagnosis of Problems in Truck Ore Transport Operations in Underground Mines Using Various Machine Learning Models and Data Collected by Internet of Things Systems. *Minerals* **2021**, *11*, 1128. https://doi.org/10.3390/ min11101128

Academic Editors: Rajive Ganguli, Sean Dessureault and Pratt Rogers

Received: 14 September 2021 Accepted: 12 October 2021 Published: 14 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Greberg [15] performed a simulation of a loading-haulage-dumping machine (LHD) and a truck to optimize the number of trucks used in haulage operation in an underground mine. Choi [16] developed a discrete event simulation program to simulate the shovel-truck transport system of an open-pit mine using the GPSS/H simulation language. Choi and Nieto [3] extended this to analyze the optimal transport path of a truck. Subsequently, they performed discrete event simulations of transport equipment and provided a function to visualize the simulation results. Park and Choi [17–22] developed GPSS/H-based programs and user-friendly programs to simulate truck-loader transport systems, considering various conditions such as fixed/real-time allocation, crusher capacity, and possibility of truck failure.

If the operational plan of the transport system of the mine has been properly established, it is also crucial to continuously monitor the operational status of the transport system and to verify whether the established plan is properly implemented at the site. Until now, research on monitoring and diagnosing the operating status of transport systems or equipment has been conducted by various researchers. Thompson et al. [23] provided the basis for mine maintenance management systems (MMS) by integrating data collected through onboard multi-sensors that were installed on trucks with existing mine communication and asset management systems. Park and Choi [24] developed a system that could collect truck travel time data using Bluetooth beacons and tablet computers. In addition, a method for analyzing and diagnosing the transport route status of underground mines was proposed using the collected data. Wodecki et al. [25] proposed a monitoring system that could identify major possible causes of machine failure events using the operational parameters of LHD in mines. Carvalho et al. [26] developed a system that could automatically identify the failure of a roller, one of the important components of a belt conveyor, by combining a thermal imaging camera with an unmanned aerial vehicle (UAV).

Recently, machine learning techniques have been actively utilized to monitor the transport systems and assets of mines, diagnose failures, and perform proper maintenance. Paduraru and Dimitrakopoulos [27] utilized neural networks and policy gradient reflection learning in data-driven decision-making processes to optimize material flow in large mining complexes. Ristovski et al. [28] used machine learning to predict the probability distributions of equipment activity durations used in mining operations. Xue et al. [29] and Sun et al. [30] used a machine learning model to predict truck travel time. Zhang et al. [31] used the support vector machine (SVM), a machine learning technique, to diagnose and classify the defects of the scraper conveyor in coal mine. D'Angelo et al. [32] proposed a method for real-time diagnosis of defects in rollers of belt conveyors using an object detection model based on a deep learning architecture.

Establishing operational plans, such as mine design, production forecasting, and equipment allocation, is important to ensure productivity and efficiency of mines. In addition, identifying in advance the section in which the truck travel time is expected to be abnormal is crucial because this makes it possible to prevent the occurrence of problems in the section and vehicle, as well as in the future. However, no research case has been reported thus far for monitoring and diagnosing the condition of a mine transport system using machine learning techniques. Therefore, we propose a method to evaluate the stability of transport routes and to diagnose the operational status by combining the mine production management system using a tablet computer and Bluetooth beacon with machine learning techniques. To this end, a limestone underground mine in Korea—to which a tablet computer and Bluetooth beacon was applied—was selected as a research area, and log data related to truck travel time were collected for a certain period. In addition, machine learning models were trained using the collected data. Thereafter, the stability of each section of the transport route in the study area was evaluated, diagnosed, and analyzed using the learned model.

#### **2. Study Area and Data Collection 2. Study Area and Data Collection**

*Minerals* **2021**, *11*, x FOR PEER REVIEW 3 of 23

In this study, an underground mine (37◦1701200 N, 128◦4305300 E) owned by Seongshin Minefield in Korea was selected as the research area. Figure 1 depicts an aerial view of the study area and an underground tunnel. The mine uses the room and pillar mining method to produce 1 million tons of high-quality limestone annually. They drill with a V-Cut method using jumbo drills and crawler drills. It then produces approximately 4500 tons of limestone, with an average of 8–9 blasts per day using ammonium nitrate fuel oil (ANFO), emulite, and electric detonator (6 ms). The mined limestone is loaded into a 25–40 tons dump truck with a loader (3.0–5.6 m<sup>3</sup> ) and transported to the crusher located outside the mine. The study area operates eight loading areas and three unloading points, and three loaders and ten trucks are used to produce limestone. In this study, an underground mine (37°17′12″ N, 128°43′53″ E) owned by Seongshin Minefield in Korea was selected as the research area. Figure 1 depicts an aerial view of the study area and an underground tunnel. The mine uses the room and pillar mining method to produce 1 million tons of high-quality limestone annually. They drill with a V-Cut method using jumbo drills and crawler drills. It then produces approximately 4500 tons of limestone, with an average of 8–9 blasts per day using ammonium nitrate fuel oil (ANFO), emulite, and electric detonator (6 ms). The mined limestone is loaded into a 25– 40 tons dump truck with a loader (3.0–5.6 m3) and transported to the crusher located outside the mine. The study area operates eight loading areas and three unloading points, and three loaders and ten trucks are used to produce limestone.

**Figure 1.** Map of the study area (Sungshin Minefield underground limestone mine, Jeongsun-gun, Gangwon-do, Korea) showing the loading areas and dumping areas. **Figure 1.** Map of the study area (Sungshin Minefield underground limestone mine, Jeongsun-gun, Gangwon-do, Korea) showing the loading areas and dumping areas.

The underground mine selected as the study area is equipped with a tablet computer and Bluetooth beacon-based mine production management system. This system provides functions for navigation, equipment proximity warning, production log creation, and measurement of truck travel time for each section of the underground mine [33]. The operation of the system is performed in the following order: (1) Signals are received from Bluetooth beacons installed at major points along the transport route, crusher, and loaders The underground mine selected as the study area is equipped with a tablet computer and Bluetooth beacon-based mine production management system. This system provides functions for navigation, equipment proximity warning, production log creation, and measurement of truck travel time for each section of the underground mine [33]. The operation of the system is performed in the following order: (1) Signals are received from Bluetooth beacons installed at major points along the transport route, crusher, and loaders

using a tablet computer mounted on the truck. (2) The tablet computer records the time the signal was received and the location of the truck, and (3) transmits the data stored in the internal memory to the cloud server in the area where wireless communication is possible. (4) Finally, the cloud server continuously stores and manages data transmitted from multiple trucks with tablet computers installed. For details on the operation of the system, please refer to Park and Choi [33]. Tablet computers were installed in 10 trucks used for transport operations. Bluetooth beacons were installed at loading and unloading points (8 and 3, respectively) and at major points along the transport route (11). Figure 2 shows an example of a tablet computer and Bluetooth beacon installed in the study area. Figure 3 depicts a schematic diagram showing the locations of the loading and unloading points and the Bluetooth beacon installed on the main transport route. using a tablet computer mounted on the truck. (2) The tablet computer records the time the signal was received and the location of the truck, and (3) transmits the data stored in the internal memory to the cloud server in the area where wireless communication is possible. (4) Finally, the cloud server continuously stores and manages data transmitted from multiple trucks with tablet computers installed. For details on the operation of the system, please refer to Park and Choi [33]. Tablet computers were installed in 10 trucks used for transport operations. Bluetooth beacons were installed at loading and unloading points (8 and 3, respectively) and at major points along the transport route (11). Figure 2 shows an example of a tablet computer and Bluetooth beacon installed in the study area. Figure 3 depicts a schematic diagram showing the locations of the loading and unloading points and the Bluetooth beacon installed on the main transport route. using a tablet computer mounted on the truck. (2) The tablet computer records the time the signal was received and the location of the truck, and (3) transmits the data stored in the internal memory to the cloud server in the area where wireless communication is possible. (4) Finally, the cloud server continuously stores and manages data transmitted from multiple trucks with tablet computers installed. For details on the operation of the system, please refer to Park and Choi [33]. Tablet computers were installed in 10 trucks used for transport operations. Bluetooth beacons were installed at loading and unloading points (8 and 3, respectively) and at major points along the transport route (11). Figure 2 shows an example of a tablet computer and Bluetooth beacon installed in the study area. Figure 3 depicts a schematic diagram showing the locations of the loading and unloading points and the Bluetooth beacon installed on the main transport route.

**Figure 2.** Example of Bluetooth beacon (Beacon i3) and Tablet PC (Galaxy A 8.0) installed for log data collection: (**a**) tunnel wall on the transport route; (**b**) near the crusher at the crusher; (**c**) windshield in the driver's seat of the truck. **Figure 2.** Example of Bluetooth beacon (Beacon i3) and Tablet PC (Galaxy A 8.0) installed for log data collection: (**a**) tunnel wall on the transport route; (**b**) near the crusher at the crusher; (**c**) windshield in the driver's seat of the truck. **Figure 2.** Example of Bluetooth beacon (Beacon i3) and Tablet PC (Galaxy A 8.0) installed for log data collection: (**a**) tunnel wall on the transport route; (**b**) near the crusher at the crusher; (**c**) windshield in the driver's seat of the truck.

points in the study area: (**a**) 2D maps; (**b**) schematic. **Figure 3.** Transport routes between loading and unloading points and Bluetooth beacon installation points in the study area: (**a**) 2D maps; (**b**) schematic. **Figure 3.** Transport routes between loading and unloading points and Bluetooth beacon installation points in the study area: (**a**) 2D maps; (**b**) schematic.

The purpose of this study is to calculate the truck travel time for each section based on the main points where Bluetooth beacons are installed, evaluate the stability of each transport route using a machine learning model, and diagnose the status of the transport route. The system developed by Park and Choi [33] uses a tablet computer to record the time a truck passes through the point where a Bluetooth beacon is installed; however, it cannot record the travel time of a truck traveling between the two beacons. Therefore, in this study, the truck travel time for each section was calculated using the log data analysis program developed by Park and Choi [24]. The program calls the log data files uploaded to the cloud server at once, organizes the log data, and calculates the truck travel time for each section.

In this study, log data collected from 9 November 2020 to 21 February 2021 (15 weeks) were used to evaluate and diagnose the stability of each section of the transport route using machine learning techniques. During this period, 361 log data files were uploaded to the cloud server, and 33,435 truck travel time data by section were collected.

#### **3. Methods**

The purpose of this study is to evaluate and diagnose the stability of the transport route by using the truck travel time for each section of the transport route and machine learning techniques. To achieve the purpose of the study, the research was conducted in the order of data collection for learning and verification, data processing, machine learning model selection and application of the model.

#### *3.1. Data Preprocessing for Machine Learning Model*

Factors for diagnosing the status of each section of the transport route include physical factors (location and slope of section, presence or absence of surrounding workplaces, width of transport routes, whether or not ores are loaded, etc.) and environmental factors (weather, presence or absence of groundwater, etc.) [24].

Therefore, the training data of the machine learning model for diagnosing the state of the transport path was composed of six input features and a label that judges the status of the transport path as shown in Table 1. Data types can be divided into categorical data and continuous data. The categorical data include the origin and destination (consisting of beacon IDs) of the transport route section and whether ores are loaded. Continuous data include truck travel time, average daily temperature, and daily precipitation.


**Table 1.** Description and data type of data set for training machine model.

Coding of raw data to train the machine learning model was performed using log data related to truck travel time, which was collected from the mine production management system and weather data provided by the Korea Meteorological Administration. The status of the transport route was determined by the mine production management system installed in the research area. The truck driver judges whether the operation of the truck was normal or abnormal by considering whether any irregularity of operation occurs, such as natural causes, vehicle maintenance, tunnel closure, work interruption, accident, or excessive waiting. The truck drivers use the application of the mine production management system to input whether the operation was performed normally or abnormally

when the loading, transporting, and unloading work is completed once. In this study, the case of normal operation was classified as 0, and the case of abnormal operation was classified as 1. Of the 33,435 truck travel time data collected by section, 3314 were classified as abnormal (1) by the truck driver.

The data types of input features used in this study consist of categorical data and continuous data. Because data of different dimensions are not normalized, features with small absolute values are ignored in the fault diagnosis system. Therefore, data were normalized using the min-max scaling-normalization method (Equation (1)):

$$\mathbf{x'}\_i = \frac{\mathbf{x}\_i - \min(\mathbf{x})}{\max(\mathbf{x}) - \min(\mathbf{x})} \tag{1}$$

this method can effectively prevent overfitting when training a machine learning model [34] and remove absolute differences between data items through data preprocessing, while maintaining relative differences in data within the same item. It can also improve the effectiveness of classification because it reduces the adjustment steps of parameters and improves the training speed of the model.

The dataset for training the machine learning model sets the ratio of the data classified as normal to the data classified as abnormal in the state of the transport path as 1:1, and consists of a set of 6000 data (normal: 3000, abnormal: 3000). To validate the model trained with the training dataset, the entire data was divided into a training dataset and a validation dataset. The training and validation datasets were set to 75% and 25% of the total dataset, respectively (i.e., training dataset: 4500, validation dataset: 1500).

#### *3.2. Experimental Setup for Machine Learning Algorithms*

In this study, the stability and status of each section in the underground mine was evaluated and diagnosed by using machine learning algorithms. For this, Gaussian naïve Bayes (GNB), k-nearest neighbor (kNN), support vector machine (SVM), and classification and regression tree (CART) were used.

Naïve Bayes (NB) is a set of supervised learning algorithms that apply Bayes' theorem with the "naive" assumption of independence between every pair of features [35]. Naïve Bayes can be trained efficiently in a supervised learning environment. Parameter estimation for the naïve Bayes model uses the method of maximum likelihood. In many applications, it has been confirmed that training is possible without accepting Bayesian probability or Bayesian methods. In addition, there is an advantage in the quite small amount of training data for estimating the parameters required for classification. NB can be mainly divided into Gaussian naïve Bayes (GNB) and multinomial naïve Bayes according to the type of data (i.e., continuous or categorical). GNB is an algorithm that calculates the continuous values associated with each class, often assuming that they follow a Gaussian distribution. For example, after dividing the training data including the continuous attribute *x* according to the class, the mean and variance of *x* in each class are called µ<sup>x</sup> and σk, respectively. Then, assuming that a certain observation value *v* has been collected, the probability distribution of the values of a given class can be parameterized with µ<sup>x</sup> and σ<sup>k</sup> and calculated through the normal distribution equation (Equation (2)):

$$p(\mathbf{x} = \upsilon | \mathbf{c}) = \frac{1}{\sqrt{2\pi\sigma\_c^2}} e^{-\frac{(\upsilon - \mu\_c)^2}{2\sigma\_c^2}} \tag{2}$$

The kNN model is one of the most intuitive and simple supervised learning models among machine learning models. The kNN does not learn in advance, but rather defers this step and then performs classification when a task request for new data is received. Therefore, it is also variously called instance-based learning, memory-based learning, or lazy learning. The idea in the kNN method is to assign new unclassified examples to the class to which the majority of its *k* nearest neighbors belong. It is effective to reduce the error of misclassification when the number of samples in the training dataset is large;

however, the classification accuracy depends on the value of *k*, the number of neighbors, and depends greatly on the distance used to calculate the closest distance to the value of *k* [36]. In simple kNN, the search is based on the number of class data classified closer to the new data. Figure 4 shows the classification of the data according to different *k* values. When the first data are found, as shown in Figure 4a, while expanding the virtual circle (in case of two-dimensional) focusing on the new data to be known, the group to which the data belong becomes the group to which the new data belong (*k* = 1). Similarly, a virtual circle is extended until three data (*k* = 3) are found, and the largest group of the three data found at this time determines the group to which the new data belong (Figure 4b). however, the classification accuracy depends on the value of , the number of neighbors, and depends greatly on the distance used to calculate the closest distance to the value of [36]. In simple kNN, the search is based on the number of class data classified closer to the new data. Figure 4 shows the classification of the data according to different values. When the first data are found, as shown in Figure 4a, while expanding the virtual circle (in case of two-dimensional) focusing on the new data to be known, the group to which the data belong becomes the group to which the new data belong ( = 1). Similarly, a virtual circle is extended until three data ( = 3) are found, and the largest group of the three data found at this time determines the group to which the new data belong (Figure 4b).

class to which the majority of its nearest neighbors belong. It is effective to reduce the error of misclassification when the number of samples in the training dataset is large;

*Minerals* **2021**, *11*, x FOR PEER REVIEW 7 of 23

**Figure 4.** Example result of kNN model according to k value: (**a**) = 1; (**b**) = 3. **Figure 4.** Example result of kNN model according to k value: (**a**) *k* = 1; (**b**) *k* = 3.

SVM was introduced by Boser et al. [37] in 1992 and has been popular in the learning community since 1996. Recently, it has been successfully applied to various problems related to pattern recognition in bioinformatics and image recognition [38]. In addition, it is sufficiently powerful to be used for both linear and non-linear regression and classification and is widely used by the public. SVM is basically a model that classifies data linearly like linear logistic regression and classifies data in three stages as shown in Figure 5. Assuming that there are two-dimensional data composed of two classes as shown in Figure 5a, there can be an infinite number of straight lines separating these classes; however, using the decision boundary selection condition of SVM, only one straight line can be selected. The selection condition is to select a hyperplane that maximizes the distance between the data points of each class that are closest to each other. First, as shown in Figure 5b, the closest points between each class are selected, and when the margin between two parallel straight lines including these points is maximized, two straight lines including these points are selected. The points used to select two straight lines are called support vectors, and when these two straight lines are determined, the central straight line located at the same distance between the two straight lines becomes the decision boundary, as shown in Figure 5c. SVM was introduced by Boser et al. [37] in 1992 and has been popular in the learning community since 1996. Recently, it has been successfully applied to various problems related to pattern recognition in bioinformatics and image recognition [38]. In addition, it is sufficiently powerful to be used for both linear and non-linear regression and classification and is widely used by the public. SVM is basically a model that classifies data linearly like linear logistic regression and classifies data in three stages as shown in Figure 5. Assuming that there are two-dimensional data composed of two classes as shown in Figure 5a, there can be an infinite number of straight lines separating these classes; however, using the decision boundary selection condition of SVM, only one straight line can be selected. The selection condition is to select a hyperplane that maximizes the distance between the data points of each class that are closest to each other. First, as shown in Figure 5b, the closest points between each class are selected, and when the margin between two parallel straight lines including these points is maximized, two straight lines including these points are selected. The points used to select two straight lines are called support vectors, and when these two straight lines are determined, the central straight line located at the same distance between the two straight lines becomes the decision boundary, as shown in Figure 5c.

The optimal hyperplane can be defined as the following equation [39]:

$$y\_i(\omega \cdot \mathbf{x}\_i + b) \ge \text{for } 1 \le i \le \mathbf{n}, \ \omega \in \mathbb{R}^d, b \in \mathbb{R} \tag{3}$$

where *x<sup>i</sup>* is an instance with its corresponding label *y<sup>i</sup>* ∈ (−1, 1), and *b* is an intercept term; that is, a normal vector to the hyperplane', *d* is the number of properties of each instance and the dimension of input vector, and *n* is the number of instances. A hyperplane is defined by the instances that lie nearest to it; such instances are called support vectors. By this definition, there should be no data points between the hyperplanes containing the support vectors (hard margin classification); however, this classification cannot occur in the real world. This is because real data often contain outliers that are significantly different from other instances of the same class, in addition to the possibility of errors in data entry, measurement errors, etc. Therefore, we used a definition (soft margin classification,

Equation (4)) proposed by Tuba et al. [39] for the optimal hyperplane by overcoming this problem and relaxing the conditions to use SVM for real data classification: that there are two-dimensional data composed of two classes as shown in Figure 5a, there can be an infinite number of straight lines separating these classes; however, using the decision boundary selection condition of SVM, only one straight line can be selected. The

$$y\_i(\omega \cdot \mathbf{x}\_i + b) \ge 1 - \varepsilon\_i \quad \varepsilon\_i \ge 0, \quad 1 \le i \le n \tag{4}$$

here, *e<sup>i</sup>* is a slack variable that allows the corresponding instance to leave the margin. To find the optimal hyperplane, we must solve the quadric programming problem as follows: points between each class are selected, and when the margin between two parallel straight lines including these points is maximized, two straight lines including these points are se-

$$\min \frac{1}{2} \|\omega\|^2 + \mathcal{C} \sum\_{i=1}^n \varepsilon\_i. \tag{5}$$

**Figure 4.** Example result of kNN model according to k value: (**a**) = 1; (**b**) = 3.

SVM was introduced by Boser et al. [37] in 1992 and has been popular in the learning community since 1996. Recently, it has been successfully applied to various problems related to pattern recognition in bioinformatics and image recognition [38]. In addition, it is

and is widely used by the public. SVM is basically a model that classifies data linearly like linear logistic regression and classifies data in three stages as shown in Figure 5. Assuming

points of each class that are closest to each other. First, as shown in Figure 5b, the closest

*Minerals* **2021**, *11*, x FOR PEER REVIEW 7 of 23

class to which the majority of its nearest neighbors belong. It is effective to reduce the error of misclassification when the number of samples in the training dataset is large; however, the classification accuracy depends on the value of , the number of neighbors, and depends greatly on the distance used to calculate the closest distance to the value of [36]. In simple kNN, the search is based on the number of class data classified closer to the new data. Figure 4 shows the classification of the data according to different values. When the first data are found, as shown in Figure 4a, while expanding the virtual circle (in case of two-dimensional) focusing on the new data to be known, the group to which the data belong becomes the group to which the new data belong ( = 1). Similarly, a virtual circle is extended until three data ( = 3) are found, and the largest group of the three data found at this time determines the group to which the new data belong (Figure 4b).

**Figure 5.** An example of a two-dimensional representation of a linearly separable binary classification: (**a**) two-dimensional data consisting of two classes; (**b**) selecting the closest points between each class, and selecting two straight lines for which the distance between two flat straight lines containing these points is maximum; (**c**) Select a hyperplane that is equidistant from two straight lines.

Here, C represents the parameter of the soft margin cost function, and the quality of the SVM model largely depends on the choice of this parameter; that is, the larger the value of C, the more similar the generated model obtained from the hard margin classification definition. However, because the soft margin classification method can only be applied to linearly separable data, a kernel function is used, rather than a dot product. The kernel function maps the instances into a higher-dimensional space to ensure that they can be linearly separated. There are various kernel functions, such as polynomial, Gauss (radial basis function or RBF), and sigmoid functions, but RBF is the most commonly used and can be defined as follows:

$$K(\mathbf{x}\_{i\prime}, \mathbf{x}\_{j}) = \exp\left(-\gamma \|\mathbf{x}\_{i} - \mathbf{x}\_{j}\|^{2}\right) \tag{6}$$

where *γ* is a free parameter that significantly affects classification accuracy and this parameter defines the impact of each training instance.

CART is a decision tree (DT)-based algorithm that can be used for both classification and regression problems [40]. The data are divided into uniform labels based on the answers (yes/no) to the predictor values through an iterative procedure, and finally a binomial tree is generated. If the dependent variable is qualitative, it is called a classification tree, and if it is quantitative, it is called a regression tree. The node containing the entire dataset is called the root node. Starting from the root node, it is divided into left and right, and this process is repeated until the estimation error related to the dependent variable is minimized to classify the data [41]. Because CART is inherently non-parametric, no assumptions are made regarding the underlying distribution of values of the predictor variables [42]. Therefore, CART can handle numerical data that are highly skewed or multimodal, as well as categorical predictors with either an ordinal or a nonordinal structure. In addition, it identifies the "splitting" variable based on a thorough search for all possibilities. Because efficient algorithms are used, CART has the advantage of being able to search for all possible variables with splitters, despite the existence of hundreds of possible predictors. CART is a relatively automated machine learning method because the analyst's input is less than the complexity of the analysis.

Grid search through 5-fold cross-validation was utilized to improve the performance of the machine learning model and the reliability of the performance evaluation on the validation dataset. In general, the performance of a machine learning model depends on parameters. Various parameters exist depending on the machine learning algorithm. Therefore, to design a model with high accuracy, it is important to set the optimal parameters. 5-fold cross-validation is a method in which a dataset is divided into 5 pieces that are used one by one as a validation dataset while the rest are combined and used as a training dataset. Using this method, 100% of the data we have can be used as a validation dataset. The grid search is originally an exhaustive search based on a defined subset of the hyper parameter space [43]. That is, when creating a model, it is a search method to find the variable with the highest performance after sequentially inputting the hyperparameters set by the user. Table 2 shows the parameters and parameter tuning used in each model. The GNB predicted the accuracy of the model by setting the range of variance (var) smoothing from 10–9 to 1 and increasing the parameter values by approximately 1.23 times because the accuracy of the model varies depending on the var smoothing. The classification accuracy of the kNN depends on the *k* value, which means the number of neighbors, and the accuracy was predicted by increasing the *k* value by 1 from 1 to 100. Because the classification accuracy of the SVM model varies greatly depending on the parameters C and *γ*, the optimal pair of parameters (C: from 10 to 100, *γ*: from 0.1 to 1) was determined by increasing the values by 5 and 0.1, respectively. Finally, in the CART model, the accuracy of the model is determined by the minimum samples leaf (min\_samples\_leaf) and minimum samples split (min\_samples\_split). In this study, the optimal parameter was determined by setting the minimum samples leaf from 1 to 10 and increasing by 1, and for the minimum samples split, setting a range from 2 to 10 and increasing by 1.


**Table 2.** Values used in grid search for parameter tuning.

#### *3.3. Validation of Machine Learning Models*

The parameter showing the highest learning accuracy of the machine learning model was determined using grid search through 5-fold cross-validation. Subsequently, the performance of the model was verified using the validation dataset (25% of the total data, 1500). Performance indicators that can evaluate the performance of a model generally depend on the type of supervised learning (regression or classification). In this study, the performance of the model was verified using the accuracy, precision, recall, and F1 score, which are typically used in classification problems. Accuracy refers to the number of correct predictions among all predictions, precision refers to the probability of the state actually being positive when a positive prediction is made, recall refers to the probability of correctly predicting an actual positive, and F1 score refers to the weighted average of precision and recall. The formula for each performance indicator is as follows:

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} \tag{7}$$

$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \tag{8}$$

$$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{9}$$

$$\text{F1 score} = \frac{2 \times \text{Recall} \times \text{Precision}}{\text{Recall} + \text{Precision}} \tag{10}$$

where, P (positive) and N (negative) denote whether the prediction of the model is positive (yes) or negative (no), and T (true) and F (false) imply whether the prediction is correct or wrong. When this is expressed as a matrix, it is called a confusion matrix and can be expressed as shown in Table 3.

**Table 3.** Confusion matrix of a classifier.


#### **4. Results**

#### *4.1. Results of Data Preprocessing*

For the learning and validation of machine models, 33,435 truck travel time data were collected for 15 weeks using the mine production management system installed in the study area. Except for the departure and arrival points for each section of the transport route and the transport time of the truck that can be acquired using the system, additional features such as daily average temperature, daily precipitation, ores loading of trucks, and labels to determine abnormal status of transport routes were entered. The decision to load ores can be divided into the case of empty truck or loaded truck. This was determined by judging whether the truck is headed for the loading points (empty truck) or the crusher (loaded truck) according to the sequence of beacon IDs for each section of the transport route. As a result of judging the transport route status of the data collected during the 15-week period, 3314 data out of 33,435 truck travel time data were found to be abnormally measured.

When all 33,345 data are used as training data, the normal case is much larger than the abnormal case, and a biased training result may appear. Therefore, the ratio of the data classified as normal to the data classified as abnormal was set to 1:1 to prepare the training dataset. A total of 6000 data were prepared by random sampling of 3000 data marked as abnormal and 3000 data marked as normal. Before normalizing the training dataset, the mean values, standard deviations, minimum and maximum values for truck travel time, daily average temperature, and daily precipitation were calculated (Table 4). Figure 6 shows the histogram for statistical values. The average truck travel time was 95.55 s and the standard deviation was 74.12 s. The average daily temperature was −2.71 ◦C and the standard deviation was 5.03 ◦C. The average daily precipitation was 0.24 mm and the standard deviation was 0.85 mm.

**Table 4.** Feature of data set for training machine learning model.


*Minerals* **2021**, *11*, x FOR PEER REVIEW 11 of 23

**Figure 6.** Feature distribution of data set for machine learning model training: (**a**) truck travel time; (**b**) average daily temperature; (**c**) daily precipitation. **Figure 6.** Feature distribution of data set for machine learning model training: (**a**) truck travel time; (**b**) average daily temperature; (**c**) daily precipitation. *4.2. Results of Model Training and Application* In this study, GNB, kNN, SVM, and CART models were used to evaluate and diag-

#### **Table 4.** Feature of data set for training machine learning model. *4.2. Results of Model Training and Application* nose the stability of truck transport routes. To design the most accurate predictive model, the parameter values related to the learning accuracy of each model were optimized. For

**Truck Travel Time (s) Average Daily Temperature (°C) Daily Precipitation (mm)** Mean 95.55 −2.71 0.24 Standard deviation 74.12 5.03 0.85 In this study, GNB, kNN, SVM, and CART models were used to evaluate and diagnose the stability of truck transport routes. To design the most accurate predictive model, the parameter values related to the learning accuracy of each model were optimized. For this purpose, grid search through 5-fold cross-validation was used. this purpose, grid search through 5-fold cross-validation was used. The classification accuracy of the GNB model depends on the parameter var smoothing. The optimal model was determined by setting the parameter value range from 10−9 to 1 and increasing the parameter value by approximately 1.23 times. Figure 7 is a graph

Minimum value 1.00 −14.30 0.00 Maximum value 299.00 11.40 6.90 *4.2. Results of Model Training and Application* In this study, GNB, kNN, SVM, and CART models were used to evaluate and diagnose the stability of truck transport routes. To design the most accurate predictive model, The classification accuracy of the GNB model depends on the parameter var smoothing. The optimal model was determined by setting the parameter value range from 10−<sup>9</sup> to 1 and increasing the parameter value by approximately 1.23 times. Figure 7 is a graph showing the accuracy of the model according to the change of the var smoothing. The accuracy of the model decreases rapidly when the parameter value exceeds 10−<sup>2</sup> . The GNB showed the highest learning accuracy (0.60) when the var smoothing value was 0.000188. showing the accuracy of the model according to the change of the var smoothing. The accuracy of the model decreases rapidly when the parameter value exceeds 10−2. The GNB showed the highest learning accuracy (0.60) when the var smoothing value was 0.000188. The accuracy of the kNN model depends on the value of k. Figure 8 shows the prediction of the accuracy of the model while increasing the value by 1 from 1 to 100. The accuracy of the kNN model was higher as the value was smaller (=1, 0.85).

**Figure 7.** Variations of training accuracy in the variance smoothing (var\_smoothing) change range of 10−<sup>9</sup> to 1.

The accuracy of the kNN model depends on the value of k. Figure 8 shows the prediction of the accuracy of the model while increasing the *k* value by 1 from 1 to 100. The accuracy of the kNN model was higher as the *k* value was smaller (*k* = 1, 0.85).

The SVM model was optimized by changing the values of C and *γ* to determine the parameter value showing the highest training accuracy. The parameter C was set in the range from 10 to 100 and increased by 5, while *γ* was increased by 0.1 from 0.1 to 0.9 to calculate the accuracy of the model. Figure 9 shows the training accuracy of the model according to the change in parameter C and *γ* value. As the values of C and *γ* increased,

the accuracy of the model also tended to increase. In the SVM model, when the C value was set to 100 and the *γ* value was set to 0.9, the model accuracy was the highest at 0.78. **Figure 7.** Variations of training accuracy in the variance smoothing (var\_smoothing) change range of 10−9 to 1.

**Figure 7.** Variations of training accuracy in the variance smoothing (var\_smoothing) change range

*Minerals* **2021**, *11*, x FOR PEER REVIEW 12 of 23

of 10−9 to 1.

*Minerals* **2021**, *11*, x FOR PEER REVIEW 12 of 23

**Figure 8.** Variations of training accuracy in the (n\_neighbors) change range of 1 to 100. **Figure 8.** Variations of training accuracy in the *k* (n\_neighbors) change range of 1 to 100. was set to 100 and the γ value was set to 0.9, the model accuracy was the highest at 0.78.

the accuracy of the model also tended to increase. In the SVM model, when the C value

**Figure 9.** Variations of training accuracy in the C change range of 10 to 100 and γ change range of 0.1 to 1. **Figure 9.** Variations of training accuracy in the C change range of 10 to 100 and *γ* change range of 0.1 to 1.

**Figure 9.** Variations of training accuracy in the C change range of 10 to 100 and γ change range of 0.1 to 1. The training accuracy of CART depends on the values of minimum samples leaf and minimum samples split. In this study, the values of two parameters were optimized by The training accuracy of CART depends on the values of minimum samples leaf and minimum samples split. In this study, the values of two parameters were optimized by increasing min\_samples\_leaf by 1 from 1 to 10 and increasing min\_samples\_split by 1 from 2 to 10. Figure 10 shows the training accuracy of the CART model depending on the changes in the two parameter values. When min\_samples\_leaf is 3 or less, the accuracy of The training accuracy of CART depends on the values of minimum samples leaf and minimum samples split. In this study, the values of two parameters were optimized by increasing min\_samples\_leaf by 1 from 1 to 10 and increasing min\_samples\_split by 1 from 2 to 10. Figure 10 shows the training accuracy of the CART model depending on the changes in the two parameter values. When min\_samples\_leaf is 3 or less, the accuracy of the model tends to decrease as min\_samples\_split increases; however, when min\_samples\_leaf was at least 4, the accuracy did not change significantly even if min\_samples\_split was increased. The training accuracy of CART showed the highest accuracy (0.94) when min\_samples\_leaf was set to 1 and min\_samples\_split was set to 4.

increasing min\_samples\_leaf by 1 from 1 to 10 and increasing min\_samples\_split by 1 from 2 to 10. Figure 10 shows the training accuracy of the CART model depending on the changes in the two parameter values. When min\_samples\_leaf is 3 or less, the accuracy of The previously determined parameters were applied to each model, and verification was performed. The validation of the machine learning models was performed using the validation dataset (25% of the total data, 1500 pieces). Tables 5–8 shows the model verification results as a confusion matrix.

(0.94) when min\_samples\_leaf was set to 1 and min\_samples\_split was set to 4.

the model tends to decrease as min\_samples\_split increases; however, when min\_samples\_leaf was at least 4, the accuracy did not change significantly even if min\_samples\_split was increased. The training accuracy of CART showed the highest accuracy

**Figure 10.** Variations of training accuracy in the min\_samples\_split change range of 2 to 10 and min\_samples\_leaf change range of 1 to 10. **Figure 10.** Variations of training accuracy in the min\_samples\_split change range of 2 to 10 and min\_samples\_leaf change range of 1 to 10.


**Table 6.** Confusion matrix classified using the kNN model.


**Table 7.** Confusion matrix classified using the support vector machine (SVM) model.



**Table 8.** Confusion matrix classified using the CART model.

Table 5 shows the verification results of the GNB as a confusion matrix. There were 687 cases (TN) where the section in which the truck travel time classified as normal was predicted to be normal. Conversely, there were 242 cases (TP) where the section classified as abnormal was predicted to be abnormal. In addition, it was found that there were 496 (FN) and 75 (FP) cases of predicting a section where the truck travel time was abnormal as normal and predicting a section where the truck was normal as abnormal, respectively. GNB's verification accuracy was 0.62, and when predicting the data classified as normal as normal, it showed relatively high accuracy (0.90); however, the accuracy of predicting data classified as abnormal as abnormal was very low at 0.33.

In the case of the kNN model, TN and TP, which are cases of correct prediction of the actual data among 1500 verification data, appeared 642 times and 616 times, respectively. FN and FP, which were failed predictions, appeared 122 times and 120 times, respectively (Table 6). The validation accuracy of the kNN model was found to be 0.84, and it showed a similar level of accuracy in all cases.

Table 7 shows the verification results of the SVM model as a confusion matrix. There were 655 cases (TN) where the section in which the truck travel time was classified as normal was predicted to be normal. Conversely, there were 542 cases (TP) where the section classified as abnormal was predicted to be abnormal. In addition, FN and FP, which were failed predictions, appeared 196 times and 107 times, respectively. The verification accuracy of the SVM model was 0.80, and it showed high accuracy (0.86) in the data classification problem, which was actually abnormal; however, in the problem of classifying actually normal data, the accuracy (0.73) was relatively low.

The verification results of the CART model are shown in Table 8. In fact, 713 times (TN) were predicted to be normal where the section in which the truck travel time was classified as normal, and 706 times (TP) were predicted to be abnormal where the section classified as abnormal. In addition, FN and FP that failed prediction appeared 32 times and 49 times, respectively. The verification accuracy of the CART model was very high at 0.95, and both the problem of classifying the actual normal sections (0.96) and the problem of classifying the abnormal sections (0.94).

The performance of the model was evaluated based on the confusion matrix of each model analyzed using the validation dataset. The performance assessment of the model was conducted using accuracy, precision, recall, and F1 score. Table 9 shows the performance index of each model. The prediction accuracy of the machine learning model was the highest in CART (94.6%), followed by kNN (83.9%), SVM (79.8%), and GNB (61.9%). The CART model also exhibited high precision, recall, and F1 score. Therefore, it can be said that the CART model achieves the best performance in the problem of evaluating the stability of the transport route for each section in the underground mine.


**Table 9.** Performance assessment indicators of machine learning (ML) models.

#### **5. Discussion**

#### *5.1. Analysis of Model Accuracy for Each Transport Route Section*

The prediction accuracy of each section was calculated using 1500 pieces of data used in the verification process of the CART model. Figure 11 shows the accuracy of the model for each section when operating with empty or loaded trucks. The model achieved an accuracy of at least 57.1% for all sections (45 sections) and exhibited an average accuracy of 93.3%. In the case of the route (23 sections) operating with empty trucks, the prediction accuracy of the model was found to be very high with an average of 90.9%. Except for four sections (beacon ID: 1→3, 5→21, 13→15, 16→17), all were confirmed to show an accuracy of at least 80%. In the case of operating with loaded trucks (22 sections), the prediction accuracy of the model was found to be very high, with an average of 95.9%. In addition, 20 of the 22 sections showed over 80% accuracy. The accuracy of the model for each section of transport route tended to be generally higher as the amount of data for each section included in the dataset used for learning increased (Table 10). Therefore, in the case of the section where the prediction accuracy of the model is high, it is judged that the model can be used sufficiently to evaluate whether the truck was operated normally in the section; however, in the case of a section where the accuracy is low, it is judged that the machine learning model should be improved through additional data collection for training the model is necessary.

#### *5.2. Further Verification of the CART Model Using Unused Data*

The CART model was further verified using the remaining 27,435 data not used to train the model. Table 11 shows the verification results of the CART model as a confusion matrix. There were 26,027 cases (TN) where the section in which the truck travel time was classified as normal was predicted to be normal. Conversely, there were 311 cases (TP) where the section classified as abnormal was predicted to be abnormal. There were three (FN) and 1094 (FP) cases of predicting a section where the truck travel time was abnormal as normal and predicting a section where the truck was normal as abnormal, respectively. The verification accuracy of the CART model using the remaining 27,435 data was 0.96, which was similar to the result (0.95) of the CART model trained and verified with 6000 data in Table 8. Table 12 shows the performance index of the CART model verified using the remaining 27,435 data. The prediction accuracy of the model was 96%, and the precision, recall, and F1 score were 22.1%, 99%, and 36.2%, respectively. In general, in the case of a classification problem using data with less data imbalance, it can be said that the higher the performance index, the better the model performance [44]. However, when data imbalance exists, even if precision is low, the model can be trusted when recall is high [45]. In the case of the remaining 27,435 data, there was an imbalance in the data because the normal data takes up a much larger proportion than the abnormal data. Therefore, it can be determined that the CART model is reliable when considers the value of Recall appears as 99%.

section.

Empty haul

**Operation Type Prediction** 

**Accuracy (%)**

**Table 10.** Relationship between the prediction accuracy and the average number of data used in machine learning for each

91–100 14 105.9 81–90 5 90.2 71–80 3 59.3 61–70 0 N/A 57.1–60 1 26.0

–100 19 138.0 –90 1 90.0 –80 1 48.0 66.7–70 1 27.0

**Average of the Number of Data Used for Machine Learning for Each Section**

**Number of Sections**

**Figure 11.** Prediction accuracy for each section of the CART model. (**a**) when operating with an empty truck; (**b**) when operating with loaded truck. **Figure 11.** Prediction accuracy for each section of the CART model. (**a**) when operating with an empty truck; (**b**) when operating with loaded truck.

**Table 10.** Relationship between the prediction accuracy and the average number of data used in machine learning for each section.



**Table 11.** Confusion matrix for further verification of the CART model on the remaining 27,435 data.

**Table 12.** Performance assessment indicators for further verification of CART model on remaining 27,435 data.


#### *5.3. Practical Use at the Underground Mine Site*

The proposed machine learning model can diagnose the operation status of the section by determining whether the truck travel time for each section is normal or abnormal. In this study, during the validation of the CART model, one section with high prediction accuracy and one with low prediction accuracy were selected. Then, using the log data additionally collected from the mine production management system, evaluation was performed on whether the truck was operated normally in the relevant section. For this purpose, log data collected during the 16th week (22–27 February 2021) were used for analysis. For the section of the transport route, the section from beacon ID 11 to 6 and section from beacon ID 13 to 14 were selected. In these sections, the validation accuracy of the model when validating the machine learning model was 100% and 82%, respectively.

First, in the case of sections 11 to 6 of beacon IDs, three trucks operated the section a total of 41 times in a week. By truck, truck A drove 1 time, truck B drove 25 times, and truck C drove 15 times. Log data for the section showed that truck travel time was measured within the normal range in 37 operations, and within the abnormal range in four operations. This section is a transport route for empty trucks toward the loading point, and there is no loading or dumping near the route. Therefore, most trucks have the characteristic of moving without stopping in the relevant section. After converting the log data of the relevant section (beacon ID 11→6) into the input data of the CART model, prediction was performed on whether the truck travel time was measured normally or abnormally. As a result, it was predicted that the truck travel time was measured within the normal range in the case of actual normal operation. In addition, in the case of abnormal operation, it was predicted that the operation was performed abnormally. In other words, it was found that the actual data and the prediction results by the CART model were identical. Table 13 shows the prediction results of the CART model for the 16-week data by classifying them by trucks that have operated the relevant section and is presented as a confusion matrix. Table 14 is a visualization of the confusion matrix divided by time period. This means that, during the period, trucks operated well reflecting the trend of the existing truck travel time. Furthermore, it means that there are no problems in the truck or in the transport section that will affect the truck travel time.


**Table 13.** Confusion matrix classified using the CART model for beacon IDs 11 to 6.

**Table 14.** Prediction result of the CART model by time/truck for beacon IDs 11 to 6.


• TN Cases in which data that are actually normal are predicted to be normal. TP: Cases in which data that are actually abnormal are predicted to be abnormal.

> Next, in the section from beacon ID 13 to 14, two trucks operated a total of 58 times (Truck A: 34 times, Truck B: 24 times) during a week. In this section, 54 truck travel times were measured within the normal range, but four times were measured within the abnormal range. This section is a transport route where an empty truck goes to the loading point. However, because the loading point (Area D) is located around the route, it is a section where variations in truck travel time may occur. As a result of predicting the state of truck travel time using the CART model for the section, the prediction accuracy was found to be very low (86.2%). Normal data were predicted as normal 46 times (TN), and abnormal data were predicted as abnormal (TP) four times. In addition, it was found that there were eight (FP) cases of predicting a section where the truck was normal as abnormal (Table 15). For this section, considering that the verification accuracy has already been shown to be low, it can be confirmed that the prediction accuracy appears low even in the prediction using the 16-week data. Table 16 is a visualization of the confusion matrix divided by time period. In this section, some prediction failures of the CART model occur. To improve the accuracy of the model, additional data collection is required for training the machine learning model, and the model needs to be improved. In addition, because some data show abnormal truck travel time, this section needs to be carefully monitored. To improve the overall productivity of the mine and the efficiency of the trucking operation,

and to reduce the time required to transport the ores, it is necessary to monitor and respond to these sections in advance.


**Table 15.** Confusion matrix classified using the CART model for beacon IDs 13 to 14.

**Table 16.** Prediction result of the CART model by time/truck for beacon IDs 13 to 14.


• TN Cases in which data that are actually normal are predicted to be normal; TP: Cases in which data that are actually abnormal are predicted to be abnormal; • FP: Cases in which data that are actually normal are predicted to be abnormal.

> If a case in which the truck travel time is abnormally predicted is observed only in a specific truck, the possibility that the truck driver's skill is insufficient, the truck's maintenance is poor, or the maintenance period has arrived should be suspected. In addition, the manager of the mine should take appropriate action in this regard. According to what we have seen so far, the proposed CART model can predict the status of truck travel time and then monitor the problem or possibility of occurrence in the transport route or equipment in advance. In addition, it can help to analyze the cause and prepare countermeasures. Therefore, the CART model can be used as a tool for mine managers to improve the productivity and efficiency of transport operations.

#### *5.4. Comparison between the Existing and Machine Learning-Based Methods*

Various researchers are using machine learning techniques to monitor and diagnose mine operating systems, equipment, and facilities. However, hitherto, no research case has been reported on monitoring and diagnosing the condition of a mine transport system using machine learning techniques. As a similar research case related to diagnosing and predicting the status of transport routes using truck travel time data, Park and Choi [24]

evaluated the stability and classified the types of transport routes using the statistics of the truck travel time for each section of the transport route. The method of collecting log data related to truck travel time is the same as that used in this study. However, Park and Choi [24] used percentiles (P10, P90) of truck travel time to evaluate the stability and condition of each section of the transport route. That is, if the newly collected truck travel time was measured in the range between percentiles P10 and P90, the status of the transport route was classified as normal; otherwise, it was classified as abnormal. The truck travel time of the mine may vary depending on the production plan, vehicle dispatch plan, tunnel maintenance and repair status, season (temperature), precipitation, and driver's driving skill. In this study, the truck travel time for each section was evaluated by considering the beacon IDs (origin and destination), temperature, precipitation, and whether the truck was loaded in addition to the transport time, and then the status of the transport route was diagnosed. Therefore, the proposed method is considered to have a higher level of reliability than the existing method used to evaluate the stability and condition of each section of the transport route by considering the statistics of the truck travel time.

#### **6. Conclusions**

In this study, we proposed a method that can utilize log data related to truck travel time and machine learning model (GNB, kNN, SVM, CART) to evaluate the stability of the underground mine transport route and to diagnose the operation status. To this end, a limestone mine that collects truck travel time data in underground mines using Bluetooth beacons and tablet computers was selected as a study area, and truck travel time data were collected for a certain period of time. In addition, learning and validation of models were performed using the collected data, and the results of monitoring and diagnosis of the transport route status in the study area were presented. As a result of performing grid search through 5-fold cross-validation using the training dataset, the accuracy (94.1%) was highest when the parameters min\_samples\_leaf and min\_samples\_split of the CART model were set to 1 and 4, respectively. In the validation of the CART model performed using the validation dataset (1500 data), data with normal truck travel time were predicted as normal 713 times, and abnormal data were predicted as abnormal 706 times. The performance of the machine learning model was judged using accuracy, precision, recall, and F1 score. The accuracy of the CART model was 94.6%, and the precision and recall were 93.5% and 95.7%, respectively, and it was confirmed that the F1 score was also high at 94.6%.

The proposed CART model proposed can be used for monitoring and diagnosing the status of the transport route that constitutes the truck transport system in the underground mine. In addition, it is judged that it can be used sufficiently as a tool to improve the productivity and efficiency of mine transport operations. Because the truck travel time for each section has variability depending on the driver's driving skill, tunnel maintenance and repair status, vehicle dispatch plan, etc., the truck travel time has a significant impact on the efficiency and productivity of truck transport operations. Therefore, it is crucial to know the section in which the truck transport operation is expected to be abnormal and to prevent problems occurring in the section, the vehicle, or the future. The proposed CART model showed an average prediction accuracy of 94.1% for all sections of the study area. This means that the stability of the transport route can be evaluated and diagnosed by judging whether the newly collected truck travel time data are measured within the normal range or within the abnormal range at a relatively high level of reliability. However, the prediction accuracy was relatively low in some sections. To improve the prediction accuracy of this section, additional collection of truck travel time data for training the machine learning model is required, and accordingly, the model will have to be improved.

In this study, it was confirmed that machine learning techniques can be used to diagnose and predict the condition of transport routes to maintain equipment and workplaces in underground mines. To that end, a method for diagnosing and predicting mine transport systems using machine learning techniques was proposed. We expect that the proposed

method can be sufficiently applied not only in underground mines but also in open-pit mines from a methodological perspective.

**Author Contributions:** Conceptualization, Y.C.; methodology, Y.C. and H.N.; software, D.J.; validation, S.P.; formal analysis, S.P. and D.J.; investigation, Y.C.; resources, Y.C.; data curation, S.P. and D.J.; writing—original draft preparation, S.P. and D.J.; writing—review and editing, Y.C.; visualization, S.P.; supervision, Y.C.; project administration, Y.C.; funding acquisition, Y.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by Energy & Mineral Resources Development Association of Korea (EMRD) grant funded by the Korea government (MOTIE) (Training Program for Specialists in Smart Mining).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data sharing not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Elaheh Talebi 1,\* , W. Pratt Rogers <sup>1</sup> , Tyler Morgan <sup>2</sup> and Frank A. Drews <sup>3</sup>**


**Abstract:** Mine workers operate heavy equipment while experiencing varying psychological and physiological impacts caused by fatigue. These impacts vary in scope and severity across operators and unique mine operations. Previous studies show the impact of fatigue on individuals, raising substantial concerns about the safety of operation. Unfortunately, while data exist to illustrate the risks, the mechanisms and complex pattern of contributors to fatigue are not understood sufficiently, illustrating the need for new methods to model and manage the severity of fatigue's impact on performance and safety. Modern technology and computational intelligence can provide tools to improve practitioners' understanding of workforce fatigue. Many mines have invested in fatigue monitoring technology (PERCLOS, EEG caps, etc.) as a part of their health and safety control system. Unfortunately, these systems provide "lagging indicators" of fatigue and, in many instances, only provide fatigue alerts too late in the worker fatigue cycle. Thus, the following question arises: can other operational technology systems provide leading indicators that managers and front-line supervisors can use to help their operators to cope with fatigue levels? This paper explores common data sets available at most modern mines and how these operational data sets can be used to model fatigue. The available data sets include operational, health and safety, equipment health, fatigue monitoring and weather data. A machine learning (ML) algorithm is presented as a tool to process and model complex issues such as fatigue. Thus, ML is used in this study to identify potential leading indicators that can help management to make better decisions. Initial findings confirm existing knowledge tying fatigue to time of day and hours worked. These are the first generation of models and future models will be forthcoming.

**Keywords:** machine learning; mine worker fatigue; random forest model; health and safety management

#### **1. Introduction**

Heavy industries such as mining, which require rotational shift schedules of their personnel, are exposed to fatigue risk. This risk manifests itself in health and safety dangers presented by fatigued individuals operating heavy equipment. Fatigue is often a contributing factor to many health and safety incidents in mines, but, in addition, fatigue can also affect cognition adversely, with a negative impact on the operational performance of mine sites. These risks need improved modeling, which can enable a better understanding and better management. Improved models can eventually lead to more progressive and dynamic fatigue management with a positive impact on operational safety and efficiency.

Bauerle et al. (2018) recently discussed the limitations and lack of studies on fatigue in the mining industry [1,2]. However, several devices and technologies have been developed to identify and reduce fatigue-related risk. These tools are appealing as a risk control approach that monitors behavioral and task performance indicators that potentially indicate increases in fatigue risk [3]. Moreover, in mine operations, many real-time operational data sets exist and have great potential to provide far more analytical insight to model future undesirable events such as fatigue.

**Citation:** Talebi, E.; Rogers, W.P.; Morgan, T.; Drews, F.A. Modeling Mine Workforce Fatigue: Finding Leading Indicators of Fatigue in Operational Data Sets. *Minerals* **2021**, *11*, 621. https://doi.org/10.3390/ min11060621

Academic Editors: David Cliff and Saiied Aminossadati

Received: 15 April 2021 Accepted: 2 June 2021 Published: 10 June 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

This paper presents a method that uses operational data sets to model workers' fatigue. The goal is to better understand the factors, tracked in operational technology systems, which could be used as predictors for fatigue events. The primary questions of this paper are: (1) Are there indicators within operational and other common data sets at mines that can be used to model fatigue events? (2) When these data sets are integrated and analyzed on common dimensions, is there potential value in analyzing the data with advanced computational tools such as machine learning algorithms? The approach presented in this paper is different from previous studies of mining fatigue because we use a machine learning model to identify predictor elements of workers' fatigue. The proposed model and future iterations may be useful in identifying environmental, operational and managerial events that lead to fatigue events in mine workers. This approach, when fully developed, has the potential to enhance safety and health management systems by quantifying areas of managerial focus.

The first step of the data analysis is assessing the preliminary relationships of the data. Based on the literature, there are some hypotheses around potential variables affecting fatigue in operators, which are tested in the initial data analysis section. First, does the average production or operational patterns of the mine influence the number of fatigue events? Is there any relation between time, week, month or year and the number of fatigue events? What are the differences between night and day shifts in terms of fatigue? Can the distribution of the fatigue events by shift and hour give us insights into the fatigue events? Lastly, are there any variables from weather data that cause a higher number of fatigue events?

#### **2. Literature Review**

Fitness for duty in mining is influenced by an individual's physical and psychological fitness, such as drug- and alcohol-induced impairment, fatigue, physical fitness, health and emotional wellbeing, including stress. Among these factors, fatigue is a strong driver of fitness for duty in mining, which significantly is caused by excessive work hours or insufficient rest periods associated with shiftwork [4,5]. Hence, while fatigue is identified as an issue that mine sites must address, studying factors that are contributing to or ameliorating fatigue issues is important. Fatigue in the workplace often results in a reduction in worker performance. Fatigue must be controlled and managed since it causes significant short-term and long-term risks. In the short term, fatigue can result in reduced performance, diminished productivity, human error and deficits in work quality. These effects might result in lower levels of alertness, coordination, judgment, motivation and job satisfaction, which cause increased severe health and safety issues including accidents and injuries [6–9]. Fatigue can also cause long-term negative health implications. These outcomes will result in future mental and physical morbidity, mortality, occupational accidents, work disability, excess absenteeism, unemployment, reduced quality of life and disruptive effects on social relationships and activities [10,11].

Based on a study by Drews et al. (2020), fatigue in the mining industry is different from other industries due to mining-specific environmental factors. Some of these factors are repetitive and monotonous tasks, involving long work hours, shiftwork, sleep deprivation, dim lighting, limited visual acuity, hot temperatures and loud noise [2]. However, Drews et al. (2020) also mention the high monotony of equipment operation in mining haulage as a key contributor to fatigue. Various psychological and physiological issues have effects on the fatigue of workers, which makes fatigue measurement and management difficult. Drews et al. (2020) extended a conceptual model of fatigue, which added sleep efficiency to a previously proposed model of fatigue [2]. This model shows that distal and proximal factors have effects on fatigue including clinical factors such as, life events and stressors, personality factors, previous shift conditions and sleep efficiency. Their study was based on data collected with haulage operator focus groups. Participants discussed factors that contributed to their fatigue, such as diet, shift schedule, travel time to work, sleep amount and quality, domestic factors, physical fitness and the presence

of sickness. Another finding from the study is that operators have a clear awareness of fatigue's impacts on their performance and how to reduce the impact through nutrition, physical fitness, etc. [2]. Even considering this, other studies show that there is no clear approach to control, monitor and mitigate the fatigue of workers by health and safety management during mine operations [2]. Some technologies can monitor drivers of fatigue, such as tracking eye movement and head orientation (PERCLOS monitoring system) or hard hats with electroencephalogram (EEG) activity tracking. Each of these technologies has its advantages and disadvantages [2]. Each can detect fatigue when worker fatigue occurs; however, these systems do not necessarily prevent or mitigate fatigue [2]. Moreover, users of these technologies, such as the PERCLOS system, expressed privacy concerns regarding the system's constant monitoring and mentioned a high number of false alarms from the equipment, thus being a nuisance [2]. Bauerle et al. (2018) mentioned that, despite the complexity and uncertainty regarding the fatigue of miners, some real solutions could be developed for improving fatigue-related issues with fatigue assessment interventions, looking beyond sleep and physical work, and shift work effects [1]. In the same vein, Drews et al. emphasized that health and safety management should take a socio-technical systems perspective, since a sole focus on technological solutions may create an illusion of safety, while not necessarily improving safety performance. Moreover, these approaches require user acceptance and high levels of trust in order not to have an adverse impact on their functionality [2]. Successfully modeling fatigue will require a multi-faceted approach and a variety of data inputs from the mining system.

In addition to the health and safety implications on workers, fatigue can result in damage or loss of expensive mine equipment such as haul trucks. Therefore, the mining industry has long focused on measuring operational risk losses for the purpose of capital allocation and the process of managing operational risks. Operational risk results from insufficient or failed internal processes, people, control, systems or external events, including equipment health, individual health and safety and worker fatigue [12]. To manage the health of equipment, organizations have deployed early warning systems through equipment monitoring and modeling technology. These technologies depend on understanding either machine design or empirical modeling methods to determine normal equipment behavior and detect any signs of abnormal behavior [13]. These technologies learn the dynamic operational behavior of equipment using equipment sensor data and create a predictive model. The predictive model output, which is the equipment's performance, will be compared with actual measurements from sensor signals to detect any abnormalities or failures [13].

The entire mine workplace could benefit from new technologies to collect and analyze real-time safety data such as fatigue monitoring data. A critical issue is the ability to use this information to react prior to an incident. The development of new technologies can assist safety managers in providing timely measures to predict an increase in risk, resulting in the prevention of serious incidents [14]. To manage the operational safety and health in mines, it is necessary to have safety indicators. There are two different types of safety indicators: lagging and leading indicators [15]. Lagging indicators evaluate safety and health using incident and illness rates, while leading indicators measure workplace activities, conditions and safety and health-related events [16]. In the case of fatigue, lagging indicators are evident after fatigue has occurred, while leading indicators are measurements that could prevent fatigue, such as sleep patterns or caffeine intake, and steps that help to lower fatigue when it is not so high. Since lagging indicators have a reactive and delayed nature, managers need to develop appropriate leading indicators to measure workplace safety and health risk [16]. Leading indicators have a predictive value regarding unsafe workplace conditions or behavior that is followed by an incident [17–20]. There are three main uses of leading indicators: monitoring the level of safety, deciding where and how to take action and motivating managers to take action [21,22]. Passive leading indicators (PLIs) are measurements that can provide an indication of the probable safety performance [14]. On the other hand, active leading indicators (ALIs) are dynamic and more subject to active

change in a short period of time [14,23]. To have predictive values, ALIs must be recorded in a timely manner in order to obtain accurate measurements and observations.

ALIs are continually being advanced as new technology is introduced into production systems. Internet of things (IoT), big data, artificial intelligence (AI) and machine learning (ML) are being used to enhance the safety, efficiency and quality of the operations [24–26]. In high-risk environments such as mines, internet of things can be used to raise safety and decrease the probability of human error and disasters [24–26]. In addition, IoT can be a relatively inexpensive and effective approach for hazard recognition and sending safety notifications [14].

Machine learning (ML) has been demonstrated to be a predictive tool to support management to make better decisions [16,27]. In spite of the abundant leading indicators, the use of ML to predict leading indicators is rare [16,27]. ML is flexible to operate, without any statistical assumptions, and has the ability to identify both linear and non-linear relationships within the phenomenon investigated [16,24,28]. Poh et al. (2018) used ML to predict safety leading indicators on construction sites [16]. They used a data set that was collected from a construction contractor to identify the input variables and develop a random forest (RF) model to forecast the safety performance of the project [16]. They mentioned that the occurrence and severity of incidents is not random, which means that there is a pattern describing the incidents, and they can be predictable [16]. This pattern can be used to explain the complexity of the leading indicators and long-term data collection helps to elucidate the interactions of safety indicators over time [16,29].

The literature suggests that finding leading indicators to predict fatigue in the mining industry is necessary [2]. Due to the complexity of fatigue, applying computational intelligence methods such as machine learning (ML) algorithms on the real-time data captured from current and future IoT technologies can benefit mine operations in modeling fatigue. Such a model could identify ALIs and predictive elements of workers' fatigue. Poh et al. collected data sets for the purpose of modeling safety. However, their study was limited to only safety data, likely neglecting other possible predictive factors. A comprehensive study incorporating a wider range of data sets will extend possible independent features in the model to identify the best predictive factors. If these factors can be developed as leading indicators of fatigue, enhanced safety and health decisions can be made earlier in the fatigue cycle.

#### **3. Methodology**

#### *3.1. Data Set Characterization*

The presented study uses 3.5 years of data at a single, large, operating surface mine. Table 1 provides an overview of the data sets, the types of information encoded in the data and the range of dates covered by each data set.


**Table 1.** Data sets' details.

The site utilizes a PERCLOS monitoring system, which has been in place since 2014. This system uses cameras to track, monitor and model the eye movements of haul truck operators [2]. The system detects certain eye movements and can determine if the eyes are closed, blinking rapidly and other factors that indicate fatigue. If the system cannot detect that the operator's eyes are open for more than 3 s, it alerts the operator using seat buzzes and vibration. In addition to a local alarm, the system also sends a message or alarm to the dispatcher, supervisor and the company supporting the system.

Data captured from the system are categorized based on type of the events: microsleep with a stable head posture, other eye closure (drowsiness), eyewear interference (clear lenses), eyewear interference (sun glasses), normal driving, bad tracking, glance down, glance away, driver leaning forward, camera covered, testing, IR pods covered, no driving, video error, seat position change, partial distraction, other. Based on the study by Drews et al. (2020), micro-sleep and drowsiness are signs of operator fatigue [2]. The study assumes that the PERCLOS system is functioning and properly collaborated. Much work has been done establishing the PERCOLS technology. Testing the viability of this technology is beyond the scope of this paper. The literature shows that the fatigue events captured by these systems are important indicators of fatigue [1,2]. Therefore, for the purpose of our study, micro-sleeps and drowsiness have been used to demonstrate a fatigue event, and other types of alarms are assumed to be system errors or because of negative behaviors such as distracted driving. These are labeled in the PERCLOS systems as "other eye-closure (drowsiness)" and "micro-sleep with stable head". The operational difference between these two categories is having a stable head posture at the time of fatigue or not. In the case that the operator's head is moving downwards, the fatigue event is labeled as "other eye closure (drowsiness)". On the other hand, when the operator has a stable head posture at the time of fatigue, it is labeled "micro-sleep with stable head".

More details of fatigue events are shown in Table 2. The average number of events per day and the number of days with these fatigue events are provided for comparison. The data show more drowsiness compared to micro-sleep, representing 60% of the fatigue events that were captured by the system. The % of days with fatigue shows that on 98% and 99% of the days, there was at least one micro-sleep and drowsiness fatigue event, respectively. Therefore, fatigue is a critical daily hazard for those working in mines.

**Fatigue Event Type Average Number of Events per Day Days with Fatigue Events Percentage of Days with Fatigue Percentage of Fatigue Events Micro-Sleep with Stable Head** 13 1313 98% 40% **Other Eye Closure (Drowsiness)** 20 1327 99% 60%

**Table 2.** Count of days and percentage of total days by fatigue type.

The surface mine maintains a fleet management system (FMS), which tracks the production and status of equipment. The FMS data are made available in a business intelligence (BI) database. Status event data provide details on the state of an asset. Status event coding can be used to determine if a piece of equipment is down for maintenance, in a production activity or in standby mode. This information is valuable to compare against event rates, as well as show breaks and delays. Other information in the BI database includes the load cycle data. A production cycle shows the load of a shovel or truck. Detailed steps within a load, such as loading, dumping, running empty, running loaded, etc., are shown. The most important data for this study are the production rate by shift/hour, which can be used to normalize the data as well as understand the activity levels of haul truck drivers.

Time and attendance data are provided via the hours worked by hourly employees. The mine uses a swipe-in/swipe-out time keeping system, the data from which are processed and loaded into a time and attendance database. The data set was used to measure shifts and hours consecutively worked by haul truck operators.

Mobile machinery such as haul trucks generates large amounts of equipment health data. The data are produced by hundreds of sensors and are used to track the location, production cycles, equipment status and equipment health alarms. The sensors can be valuable predictors of production achievements and operator behavior. The surface mine utilizes an equipment health database to capture and model the health and use of their large capital assets. These databases track in detail how a given piece of equipment is being operated at any given time. The sensors can detect if an operator is operating outside of the safe boundaries of the machine and create an alarm. These alarms vary by severity and location and generate massive amounts of data.

Lastly, weather data are gathered from a local weather station in the mine. This data set includes information for the weather at the mine site. Over 10 variables are captured at 10 min intervals. Each interval contains information regarding temperature, temperature change, wind speed, precipitation and air pressure.

#### Data Pre-Processing

In this step, data need to be pre-processed to make them appropriate for the application of the chosen modeling approach. Initial data analyses are performed to identify possible patterns of data with the identified fatigue events. This analysis informs the next modeling step by identifying an appropriate approach to predict fatigue events with the data sets.

Fatigue data provided from the fatigue monitoring system were reviewed and divided in different categories. Among them, drowsiness and micro-sleeps were identified as the fatigue events occurring among workers, so they are considered to be the dependent variables of the model. All other data, including weather, production cycles, equipment health alarms and time and attendance data, are modeled as predictors and criterion variables.

Each data set had to be cleaned and missing data removed prior to input to the model. The process of cleaning data entails removing incorrect, duplicate, incomplete and corrupted data. Updating data types is also a common cleaning activity. A list of all variables used in the model is given in Table 3. After all data engineering, data are prepared for two distinct models: shift-based and hourly-based models. Data sets were thus grouped by shift ID and hour of data time.

#### *3.2. Initial Data Analysis*

As stated above, the primary questions posed by this study are: Are there new indicators within existing mining data sets that can be used to model fatigue events? In addition, what are potential patterns when these data sets are analyzed? In this section, the available data sets are presented to explore how they can be used to test the hypothesis of the research. Modern machine learning approaches require various levels of data engineering to facilitate statistical analysis. This section presents the process and logic used to identify key variables and the direction for further data engineering used in the development of the ML model. More specifically, the analyses presented here cover the distribution of the fatigue events, average production compared to fatigue events, number of fatigue events during night and day shifts and temperature versus fatigue events.

Fatigue is first examined by analyzing its frequency distribution by shift, which suggests non-normal distribution, as illustrated in Figure 1. This figure visualizes the distribution of the fatigue events per shift, which is seemingly close to Poisson distribution, with a mean of approximately 17 events per shift. Calculation of the probability of having 0 and >52 events per shift shows, respectively, very low probabilities *p* = 0.013 and *p* = 0.0097. However, the probability of having 7–8 events per shift, which is the mode of the distribution, is estimated to be *p* = 0.052. The next question is why some shifts have a higher number of fatigue events compared to other shifts. Therefore, to find the potential variables that drive this difference, aggregated data by shift are included in the model.


**Table 3.** List of the variables based on the data source.

>52

In order to analyze the effect of shift time on fatigue, Figure 2 shows the average hourly production and hourly number of fatigue events per person (including drowsiness and micro-sleep). Shift change times (7 am/pm) are indicated by substantial reductions in fatigue events due to the relatively high levels of activities associated with shift changes. In addition, the results illustrate that fatigue counts increase from the beginning of a night shift until the shift end; however, during day shifts, the fatigue levels of the operators peak at around 1 pm. Regarding the relationship between the numbers of fatigue events and hourly production, the findings suggest no clear relationship. Figure 2 suggests that the time of day and shift type could be included as additional variables in the model. This figure also suggests a negative relationship between production and fatigue. Production rates, disruptions and aggregate levels, to a certain extent, affect the operational behavior of the site. A higher number of cycles or longer cycles have the potential to influence how engaged operators are, which could provide an interesting additional measure to predict fatigue. Information about production cycles and delays will be modeled against fatigue to further explore this potential relationship. In order to analyze the effect of shift time on fatigue, Figure 2 shows the average hourly production and hourly number of fatigue events per person (including drowsiness and micro-sleep). Shift change times (7 am/pm) are indicated by substantial reductions in fatigue events due to the relatively high levels of activities associated with shift changes. In addition, the results illustrate that fatigue counts increase from the beginning of a night shift until the shift end; however, during day shifts, the fatigue levels of the operators peak at around 1 pm. Regarding the relationship between the numbers of fatigue events and hourly production, the findings suggest no clear relationship. Figure 2 suggests that the time of day and shift type could be included as additional variables in the model. This figure also suggests a negative relationship between production and fatigue. Production rates, disruptions and aggregate levels, to a certain extent, affect the operational behavior of the site. A higher number of cycles or longer cycles have the potential to influence how engaged operators are, which could provide an interesting additional measure to predict fatigue. Information about production cycles and delays will be modeled against fatigue to further explore this potential relationship. In order to analyze the effect of shift time on fatigue, Figure 2 shows the average hourly production and hourly number of fatigue events per person (including drowsiness and micro-sleep). Shift change times (7 am/pm) are indicated by substantial reductions in fatigue events due to the relatively high levels of activities associated with shift changes. In addition, the results illustrate that fatigue counts increase from the beginning of a night shift until the shift end; however, during day shifts, the fatigue levels of the operators peak at around 1 pm. Regarding the relationship between the numbers of fatigue events and hourly production, the findings suggest no clear relationship. Figure 2 suggests that the time of day and shift type could be included as additional variables in the model. This figure also suggests a negative relationship between production and fatigue. Production rates, disruptions and aggregate levels, to a certain extent, affect the operational behavior of the site. A higher number of cycles or longer cycles have the potential to influence how engaged operators are, which could provide an interesting additional measure to predict fatigue. Information about production cycles and delays will be modeled against fatigue to further explore this potential relationship.

**Figure 2.** Hourly fatigue events and average hourly production. **Figure 2.** Hourly fatigue events and average hourly production. **Figure 2.** Hourly fatigue events and average hourly production.

To illustrate the relationship between hourly data and the frequency of fatigue events, their distribution is provided in Figure 3. This right-skewed distribution shows that

more than 50% of the hours contain at least one fatigue event. This suggests that further exploration is needed to identify the variables contributing to the range of hourly fatigue events. Therefore, a second model with hourly aggregated data is developed, which will be introduced in the model section. In addition, Figure 4a shows that night shifts contain significantly more events compared to day shifts. Moreover, the average event counts by month indicate a seasonality effect, with lower rates of fatigue in spring and higher rates in summer and winter (Figure 4b). To summarize, the above explorations demonstrate that some variables, such as shift type, time of day and worked hours, have effects on fatigue. At the same time, the findings suggest that advanced approaches will be required to model fatigue events. events, their distribution is provided in Figure 3. This right-skewed distribution shows that more than 50% of the hours contain at least one fatigue event. This suggests that further exploration is needed to identify the variables contributing to the range of hourly fatigue events. Therefore, a second model with hourly aggregated data is developed, which will be introduced in the model section. In addition, Figure 4a shows that night shifts contain significantly more events compared to day shifts. Moreover, the average event counts by month indicate a seasonality effect, with lower rates of fatigue in spring and higher rates in summer and winter (Figure 4b). To summarize, the above explorations demonstrate that some variables, such as shift type, time of day and worked hours, have effects on fatigue. At the same time, the findings suggest that advanced approaches will be required to model fatigue events. that more than 50% of the hours contain at least one fatigue event. This suggests that further exploration is needed to identify the variables contributing to the range of hourly fatigue events. Therefore, a second model with hourly aggregated data is developed, which will be introduced in the model section. In addition, Figure 4a shows that night shifts contain significantly more events compared to day shifts. Moreover, the average event counts by month indicate a seasonality effect, with lower rates of fatigue in spring and higher rates in summer and winter (Figure 4b). To summarize, the above explorations demonstrate that some variables, such as shift type, time of day and worked hours, have effects on fatigue. At the same time, the findings suggest that advanced approaches will be required to model fatigue events.

To illustrate the relationship between hourly data and the frequency of fatigue

To illustrate the relationship between hourly data and the frequency of fatigue events, their distribution is provided in Figure 3. This right-skewed distribution shows

*Minerals* **2021**, *11*, x FOR PEER REVIEW 8 of 22

*Minerals* **2021**, *11*, x FOR PEER REVIEW 8 of 22

11,000

**Figure 4.** (**a**) Fatigue events per shift; (**b**) Average monthly fatigue events. **Figure 4.** (**a**) Fatigue events per shift; (**b**) Average monthly fatigue events. **Figure 4.** (**a**) Fatigue events per shift; (**b**) Average monthly fatigue events.

erator fatigue. Figure 5 illustrates the monthly average ambient temperature and monthly fatigue events per person, without any clear pattern. Thus, there appears to be no obvious correlation between temperature and fatigue events in this plot. Therefore, for further exploration, weather data are added as independent variables to the model. Next, we conduct an exploration of the influence of environmental variables on operator fatigue. Figure 5 illustrates the monthly average ambient temperature and monthly fatigue events per person, without any clear pattern. Thus, there appears to be no obvious correlation between temperature and fatigue events in this plot. Therefore, for further exploration, weather data are added as independent variables to the model. Next, we conduct an exploration of the influence of environmental variables on operator fatigue. Figure 5 illustrates the monthly average ambient temperature and monthly fatigue events per person, without any clear pattern. Thus, there appears to be no obvious correlation between temperature and fatigue events in this plot. Therefore, for further exploration, weather data are added as independent variables to the model.

Next, we conduct an exploration of the influence of environmental variables on op-

The main purpose of the above analyses was to explore relationships between fatigue events and variables contained in the existing data sets. From our initial data analyses, fatigue appears to have some relationship with variables such as weather, shift type, time of day, etc. These analyses introduce more variables for the purpose of the modeling and data aggregating methods. The full list of variables is shown in Table 3. However, these preliminary analyses are not able to identify a pattern of fatigue based on these variables, although they are able to provide a critical insight into the data. The literature shows that

**Fatigue** 

**monitoring system** 

fatigue is a complex issue and different psychological and physiological variables influence fatigue in workers [2]. Considering the limitations of the above analytical approaches, we use machine learning (ML) approaches as an alternative to explore the data set to elucidate relationships that are not easily identifiable. Because the above analyses show that shift type and hour of day appear to have significant effects on the fatigue of haul truck drivers, data were aggregated by shift and hour to create two different models. One approach involves fatigue prediction using the shift-based data, and the other uses hourly data to predict fatigue. The next section presents the modeling approach. *Minerals* **2021**, *11*, x FOR PEER REVIEW 9 of 22 Ready Production Count Integer (0 to 1322) Drowsiness and Micro-Sleep Fatigue Events Count (Normalized) Float (0 to 1) The main purpose of the above analyses was to explore relationships between fatigue

events and variables contained in the existing data sets. From our initial data analyses,

Normal Alarm Count Integer (0 to 121) Operational Alarm Count Integer (0 to 819) Undetermined Alarm Count Integer (0 to 1282) Scheduled Down Count Integer (0 to 85) Unscheduled Down Count Integer (0 to 141) Operational Delay Count Integer (0 to 1126) Operational Down Count Integer (0 to 80) Ready Non-Production Count Integer (0 to 977)

**Figure 5.** Monthly fatigue events per person vs. average temperature. **Figure 5.** Monthly fatigue events per person vs. average temperature. hourly data to predict fatigue. The next section presents the modeling approach.

*Minerals* **2021**, *11*, x FOR PEER REVIEW 10 of 22

#### **Table 3.** List of the variables based on the data source. *3.3. Machine Learning Model 3.3. Machine Learning Model*

**Data Source Variables Data Type and Example Data** Shift ID Integer (1 to 4140) Figure 6 presents the procedure and methods of the modeling steps involved in the development of the machine learning model. The process involved the following steps: Figure 6 presents the procedure and methods of the modeling steps involved in the development of the machine learning model. The process involved the following steps:

Day of year Integer (1 to 365)


**Weather** 

**Equipment health** 

**alarms and events** 


Mean Load Capacity Percentage (broken down by fleet, creating 8 variables) Float (0 to 1) **Figure 6. Figure 6.** Procedure diagram for the prediction of fatigu Procedure diagram for the prediction of fatigue using random forest regression algorithm. e using random forest regression algorithm.

#### Mean Loaded Travel Lift Distance Float (3735.2 to 13,711.66) 3.3.1. Random Forest Regression Algorithm

St Dev Loaded Travel Distance Float (604.55 to 16,118.27) Mean Barometric Pressure Float (0 to 25.1) Mean Precipitation Float (0 to 439.1) Mean Temperature (2 m) Float (−6.8 to 34.8) Min Barometric Pressure Float (0 to 25.01) Min Precipitation Float (0 to 29.71) The machine learning model selected for this analysis is a random forest (RF) regression algorithm. Random forest algorithms were chosen for their tendency to generalize well to a wide variety of problems, their rapid speed of training and because they are a key feature of many well-known machine learning solutions. Another key benefit of using random forests is the tooling that has been built in recent years to help researchers to gain

> Max Precipitation Float (0 to 756.9) Max Temperature (2 m) Float (−4.435 to 36.82) Sum Precipitation Float (0 to 5269.17)

Both Alarm Count Integer (0 to 632) Electrical Alarm Count Integer (0 to 892) Lockout Alarm Count Integer (0 to 35) Maintenance Alarm Count Integer (0 to 1094) Mechanical Alarm Count Integer (0 to 1753) None Alarm Count Integer (0 to 2608)

Mean Loaded Travel Distance Float (3735.2 to 13,711.66) Mean Loaded Travel Lift Float (272.25 to 1083.29)

Mine Load Capacity Percentage Float (0 to 1)

insights into what has long been thought of as the black box of machine learning. These new analytical tools allow researchers to see the features that the model relies upon the most in order to make predications and determine how marginal changes in these features impact the predicted outcomes [30,31]. insights into what has long been thought of as the black box of machine learning. These new analytical tools allow researchers to see the features that the model relies upon the most in order to make predications and determine how marginal changes in these features impact the predicted outcomes [30,31].

The machine learning model selected for this analysis is a random forest (RF) regression algorithm. Random forest algorithms were chosen for their tendency to generalize well to a wide variety of problems, their rapid speed of training and because they are a key feature of many well-known machine learning solutions. Another key benefit of using random forests is the tooling that has been built in recent years to help researchers to gain

*Minerals* **2021**, *11*, x FOR PEER REVIEW 11 of 22

3.3.1. Random Forest Regression Algorithm

When data are not linearly scattered, a regression tree, which is a type of decision tree, can be used. In this type of decision tree, each leaf presents a threshold value (TV) for each feature of the model. For the purpose of finding the best decision tree, the model tries to find the best threshold value for each feature (independent variable) by finding the minimum sum of square residuals (SSR). SSR is the sum of the squared difference of each prediction value and actual value (Figure 7). For models with more than one feature, the decision tree root is the feature with the lowest SSR. Figure 8 represents an example of a random forest regression decision tree with five features. When data are not linearly scattered, a regression tree, which is a type of decision tree, can be used. In this type of decision tree, each leaf presents a threshold value (TV) for each feature of the model. For the purpose of finding the best decision tree, the model tries to find the best threshold value for each feature (independent variable) by finding the minimum sum of square residuals (SSR). SSR is the sum of the squared difference of each prediction value and actual value (Figure 7). For models with more than one feature, the decision tree root is the feature with the lowest SSR. Figure 8 represents an example of a random forest regression decision tree with five features.

**Figure 7.** Finding TV and SSR for random data (*e* or error is difference of each predicted value (average value) and actual value, and *i* is the feature number). **Figure 7.** Finding TV and SSR for random data (*e* or error is difference of each predicted value (average value) and actual value, and *i* is the feature number). *Minerals* **2021**, *11*, x FOR PEER REVIEW 12 of 22

ping is a procedure that resamples a single data set to create many simulated samples. For each of the bootstrap samples, the algorithm increases a classification or regression tree. This algorithm chooses a random sample of the predictors and selects the best split among variables. Then it predicts new data by aggregating the predictions of the trees. Models can estimate the error rate based on the training data by each bootstrap iteration [30,31].

**Figure 8.** Schematic random forest regression decision tree. **Figure 8.** Schematic random forest regression decision tree.

A random forest regression algorithm is an ensemble of randomized regression trees. The random forest algorithm creates bootstrap samples from the original data. Bootstrapping is a procedure that resamples a single data set to create many simulated samples. For each of the bootstrap samples, the algorithm increases a classification or regression tree. This algorithm chooses a random sample of the predictors and selects the best split among variables. Then it predicts new data by aggregating the predictions of the trees. Models can estimate the error rate based on the training data by each bootstrap iteration [30,31].

#### 3.3.2. Model

Two models were created using available data subsets as dependent variables. For the shift-based model, the dependent variable was the number of fatigue events in a 12 h shift, which was normalized to scheduled haulage hours in the shift (labor hours). For the hourly-based model, the dependent variable was the number of fatigue events in an hour, which was normalized to scheduled haulage hours (labor hours). All independent variables in these models, also known as features, are representations of the mine's operation as represented in the data sets. These features contain values such as the average production, average temperature and equipment alarm (see Table 3).

For this model, data were divided into two sets: 80% constituted the training data set and 20% constituted the validation data set. The goal of these models was to determine the features that can predict fatigue in such a way that minimizes *RMSE*. In these models, only data subsets with micro-sleeps and drowsiness containing fatigue were modeled. From 151,432 possible events, only 44,953 contained micro-sleep and drowsiness in the data sets to train and validate the random forest algorithm. After exploratory data analysis, this study focused on refining models to predict the fatigue of the operator.

The independent variables (features) in these models were minimally engineered. Then, possible sample counts, means, sums, mins and maxes were used without combining multiple fields from the underlying tables. The goal of these models was to predict fatigue as well as the possibility of including all available feature sets, such as the hour of the day, shift, month of the year, ambient temperature, wind speed, precipitation, etc. Data for these models were constrained to the number of days contained in the fatigue data, which was a dependent variable. Thus, the models were created using data from 1 January 2014 to 9 August 2017.

#### 3.3.3. Evaluating Model Performance

One way to evaluate the model performance is out-of-bag error or *OOB*. The out-ofbag set includes data not chosen in the sampling process when initially building a random forest. The out-of-bag (*OOB*) error is the average error for each calculated prediction from the trees not contained in the respective sample. Here, we used the Random Forest python package, which can generate two optional information values, a value of the importance of the predictor variables (feature importance) and a value of the internal structure of the data (the proximity of different data points to one another) [30].

Next, the performance of the model was evaluated using the root mean squared errors (*RMSE*) and coefficients of determination (*R* 2 ). The coefficient of determination is the best method to compare models that are trained using different dependent variables. Both *RMSE* and coefficients of determination are important means of measuring performance between models trained to predict the same dependent variable. The reason that *R* 2 should be used when comparing models trained on different dependent variables is that the coefficient of determination is normalized to the mean of the dependent variable for each model.

#### 3.3.4. Model Generalization

When creating machine learning models, it is important to ensure that the predictions are generalizable to data that the model was not trained on. A model that has a very low training error but a very high validation error is considered not to generalize well. This scenario is known as overfitting [32]. The most common method to ensure that a

model has not been overfit is splitting data into training and validation sets. The model learns its parameters from the training data set. The performance of the trained model is then determined by how well it predicts the outcomes of the validation data set. The hyperparameters of the model can then be tuned by the developer, and the model is retrained to improve its performance against the validation data set. Hyperparameters are the values that define the model and cannot be learned from data; they are set by the developer of the machine learning algorithm (number of estimators, max number of features, etc.). The number of estimators and the max number of features for the best model here are 1000 and sqrt (number of features), respectively. For each model here, the data sets were split into training and validation sets. In this study, due to a lack of sufficient data for a double hold-out (test set), there is only a validation set.

#### 3.3.5. Feature Importance

Feature importance is the process of ranking the individual elements of a machine learning model according to their relative importance to the accuracy of that model [33,34]. Feature importance is a means of determining the features that have the greatest magnitude of effect in a model. Features that have a high feature importance value have a greater impact on the model. Feature importance refers to a technique to assess the scores of independent variables to a predictive model. It indicates the relative importance of each independent variable (feature) when making a model prediction. These scores can be used to better understand the data and model and reduce the number of input features. The relative scores of feature importance can highlight which features are more useful to predict fatigue and, conversely, which features are the least helpful to predict fatigue. This may be used as the basis for gathering more or different data. Moreover, it shows that the model has been fit to the most important features. In addition, feature importance can be used to improve a predictive model. It can be used to eliminate the features with the lowest scores or retain those with the highest scores. Therefore, it can help to select features and speed up the modeling process.

#### **4. Results**

Differences between the two models were found after the analyses. The hourly-based model does not perform as well as the shift-based model according to their *R* <sup>2</sup> and *RMSE*. The best model used the shift-based data to predict fatigue. Below, we discuss feature importance and drop column tools to examine the feature set of the shift-based model.

In Table 4, the results of the best performed model are displayed. The best-performing model predicted fatigue events across the site, with an *R* <sup>2</sup> value of 0.36 and *RMSE* value of 0.006. All other models were deemed to have values too low to warrant further exploration using this feature set. The best model used the shift-based data to predict fatigue. Below, we discuss feature importance and drop column tools to examine the feature set of the shift-based model.


**Table 4.** Refined model performance results.

#### *4.1. Feature Importance of Best-Performing Model*

Generally, feature importance provides a score that identifies the value of each feature in creating the random forest model. Features that have a greater effect on key decisions have higher relative importance. Table 5 shows the most important features and their values for the fatigue event prediction model (shift-based model) with the best performance.


**Table 5.** Permutation importance of features for shift-based model (most important features).

The shift type (day/night shift) variable has the strongest effect on the model. Next, the amount of unscheduled downtime of the equipment of the whole mine for a shift affected the model. "Unscheduled downtime" is when a piece of equipment goes down for maintenance reasons in an unplanned situation. Other factors that have effects on fatigue are production variables. These outcomes corroborate with the initial data analyses regarding the effects of day shifts and night shifts on the fatigue of workers. It also demonstrates that production and equipment alarm variables such as equipment downtime can aid in predicting the occurrence of fatigue events. Moreover, weather variables such as maximum and average temperature can increase the rate of fatigue events among workers.

#### *4.2. Drop-Column Feature Importance*

With large data sets, there is always a risk of having variables that are covariates or have co-dependences. Random forest tools recognize that this risk exists and include mechanisms to address it, which can assess the individual effects of each feature on the model. Co-dependencies stem from the fact that the trees are not independent since they are sampled from the same data in the process of making the RF model. It is important to see how the model works without individual features and how each feature impacts the model, whether positively or negatively. Instead of carrying out different iterations, random forest algorithms have a built-in tool which runs models with fewer features and tracks the models' performance. This is achieved by dropping out each column (or feature)

from the model, retraining the entire model and then comparing the score with the base score. Negative values show features that improve the model when removed. Positive values show features that weaken the model when removed. Values that are close to zero tend to indicate features that are correlated with other features; thus, removing them makes little difference in the model's ability to find relationships using the correlated variables. In Table 6, the ten most and ten least important features are displayed.


**Table 6.** Drop-column importance for the best model.

As shown in Table 6, shift type has the strongest effect on the model, followed by some production and alarm variables, as indicated by their feature importance score. This score shows that dropping, for example, shift type, from the features, causes the performance of the model (*R* 2 ) to drastically decrease by 0.2922. On the other hand, eliminating the mine load capacity percentage increases the performance of the model by 0.0325.

#### *4.3. ICE Plot*

Another tool to visualize how marginal changes in features affect the predictions of the model is an individual conditional expectation (ICE) plot. An ICE plot identifies the dependence of the prediction on a feature for each instance independently. It generates one line per instance, which can be compared to one line overall in partial dependence plots. A partial dependence plot (PDP) is the average of the lines of an ICE plot. The value of a line or model score is compared when all other features are kept the same. The result is a set of points for an instance with a feature value from the grid and the respective predictions [35]. ICE plots for the four top features are displayed in Figure 9 and ICE plots of ten top features are provided in Appendix A. They show how the models' predictions change depending on marginal changes to the top features. For instance, they illustrate that the prediction difference in the model for shift of day decrease from the day shift to the night shift. The ICE plot of unscheduled downtime count shows that the prediction difference of the model for a small amount of unscheduled downtime of the equipment is not high, but it starts to increase after 40 counts of unscheduled downtime of the equipment. Moreover, the prediction difference of the model for mine measured production is small, but it increases after 300,000 tons of production. Therefore, looking at the marginal changes in the top features offers insights into how marginal changes affect the model prediction.

**Figure 9.** ICE plots of the top three features (Blue lines identify the dependence of the prediction on a feature for each instance independently, and yellow line is PDP which shows the average of the blue lines). **Figure 9.** ICE plots of the top three features (Blue lines identify the dependence of the prediction on a feature for each instance independently, and yellow line is PDP which shows the average of the blue lines).

#### *4.4. Comparison 4.4. Comparison*

Table 7 shows the top features from two different generated models. Of the models that were run, the best performance was demonstrated by the shift-based fatigue model that is used to predict fatigue events based on shift data. This model achieved an *R2* value of 0.36, which is reasonably high for the prediction of outcomes that are the result of very complex interactions. Fatigue is a complex issue and can occur for different psychological and physiological reasons; therefore, it is difficult to predict it with high accuracy. Table 7 shows the top features from two different generated models. Of the models that were run, the best performance was demonstrated by the shift-based fatigue model that is used to predict fatigue events based on shift data. This model achieved an *R* <sup>2</sup> value of 0.36, which is reasonably high for the prediction of outcomes that are the result of very complex interactions. Fatigue is a complex issue and can occur for different psychological and physiological reasons; therefore, it is difficult to predict it with high accuracy.



Another model, which is based on the hourly aggregated fatigue occurrence, identifies that the time of day helps to predict the fatigue as expected, since this is one of the top features in the hourly-based model. Moreover, ambient temperature has a notable effect on fatigue, which is evidenced by the hourly-based model; however, it is obvious that temperature is linked to the time of day. More work is needed to assess the potential role of air conditioning in this. In addition to time and weather factors, some production and equipment health alarm variables have effects on the fatigue of haul truck drivers, as the hourly-based model shows (see Table 7).

#### **5. Discussion**

The model output identifies the variables that have the greatest impact on all fatigue events. Table 8 illustrates the most important features and their data sources. The results confirm our existing understanding of fatigue and offer some interesting insights into additional factors that potentially cause fatigue. While it is not surprising that shift type causes fatigue, it is interesting that maintenance processes such as unscheduled downtime and production rates, as well as other operational variables, can affect fatigue among haul truck drivers. Having identified these additional predictors for fatigue, these indicators can be used by managers to prioritize safety management efforts. The ICE plots show how marginal changes to specific variables affect the model. Therefore, they can be potentially used as thresholds for KPIs. For example, if the mine is approaching a value of 40 for unscheduled downtime, a higher risk of fatigue is indicated.

**Table 8.** Top features by data classification.


In many fields of science, it is difficult to consider models that achieve *R* <sup>2</sup> values of high magnitude. Since fatigue is a complex issue, finding a comprehensive model with a high *R* 2 is challenging. However, the methodology and future iterations could provide beneficial insights. The finding that 36% of fatigue events can be explained by shift type, weather and operational data indicates that 64% of the variance can be attributed to factors that we currently are not modeling. Therefore, the next step in fatigue modeling would be exploring additional contributors to operator fatigue. In this study, the mine's data have been aggregated according to shift or hour, but future models could examine fatigue in a more individualistic way. Deeper integration of the data sets upon individual operators could be one way of accomplishing this. Additional factors such as an individual's habits and sleep patterns could also provide another level to the model and would give a more detailed view of the fatigue of the workers.

From the perspective of health and safety management, the most important features found in this study can be considered potential leading indicators (ALIs) to reduce fatigue. The surprising finding of unscheduled equipment downtime events is an aspect that needs to be explored further. Process disruption's impact on fatigue was one finding that was consistent with the study by Drews et al. (2020) [2]. More research from a health and safety perspective is needed to understand why some of the alarm and production variables of different fleets have a greater effect on fatigue. However, fitness for duty could be one reason behind the different fatigue events for different fleets. Mining companies can use these indicators to anticipate increases in fatigue and to potentially mitigate fatigue. These model outcomes can be utilized to implement health and safety policies, training programs and mitigation practices. If mine operations can identify the times and shift types that are more susceptible to fatigue, specific strategies could be implemented, such as mandatory break times for the operators and supervisory support during this time. Management can also train the operators to be more alert at specific times of the day and during specific shifts. They also can train them to be more aware of how fitness can decrease fatigue. The models' output shows that ambient temperature has also significant effects on the fatigue of haul truck drivers. This also must be studied further to understand the degree to which this factor influences specific individuals' fatigue states.

Moreover, the hourly-based model results provide an understanding of the effects of the variables that impact fatigue for health and safety management. It demonstrates that a leading indicator to predict fatigue is the time of day. Therefore, special attention and planning is required for those times with a higher risk of fatigue. All of these outcomes can be considered when prioritizing tasks by health and safety management.

#### **6. Limitations and Future Work**

This study shows the application of machine learning in health and safety management using operational data sets of mining operations. The findings of this study confirm that fatigue is caused by a wide variety of factors and many are likely very difficult to quantify, but there may be a small but impactful percentage of factors that can be quantified. Fatigue prediction is a matter of predicting the complex interactions between human behavior and the ever-changing work environments at mines. In the social sciences, it is very common to see situations where a low *R* <sup>2</sup> value captures relationships that quantify a relatively high amount of variance in a complex relationship [36]. Individual worker data can be added to the model to increase the accuracy of the prediction model, since only operational data and weather data are utilized in these models.

In all of the models developed, the training scores are substantially better than the validation scores. This is most often attributable to overfitting of the model, but in this case, it is likely largely due to the difficultly in generalizing a model that can predict fatigue due to the complex psychological and physiological factors associated with fatigue. This line of research will become more important as the fitness for duty of equipment operators takes on greater significance in scheduling operator work shifts.

Even when using a fairly simple model with a small data set, the best-performing model in this study is able to achieve excellent results. Many refinements were made to the models during this study, but there are many avenues of exploration that could yield even stronger predictive models. Some key areas to explore in future models could include:

• Looking at individual fatigue events instead of the aggregated fatigue events;

	- Creating common naming conventions between data sets so that they can be linked by location, operator and equipment;
	- Adding more complex features such as the sleep pattern, health condition, fitness or diet of the operator; • Adding some features related to the working schedule of the operator in terms of
	- Adding features that represent information collected during time periods prior to when the fatigue occurred, such as downtime or production on the previous day; fatigue at the time and the day or week before; • Exploring more details of each feature to reduce the number of features that have a
		- Adding some features related to the working schedule of the operator in terms of fatigue at the time and the day or week before; lower impact on fatigue.
	- Exploring more details of each feature to reduce the number of features that have a lower impact on fatigue. **Author Contributions:** Conceptualization, E.T. and W.P.R.; methodology, E.T., W.P.R. and T.M.; software, E.T. and T.M.; validation, W.P.R., E.T. and F.A.D.; formal analysis, E.T.; investigation, E.T.;

resources, W.P.R.; data curation, E.T. and T.M.; writing—original draft preparation, E.T.; writing—

**Author Contributions:** Conceptualization, E.T. and W.P.R.; methodology, E.T., W.P.R. and T.M.; software, E.T. and T.M.; validation, W.P.R., E.T. and F.A.D.; formal analysis, E.T.; investigation, E.T.; resources, W.P.R.; data curation, E.T. and T.M.; writing—original draft preparation, E.T.; writing review and editing, E.T., W.P.R. and F.A.D.; visualization, E.T.; supervision, W.P.R.; project administration, W.P.R.; funding acquisition, W.P.R. All authors have read and agreed to the published version of the manuscript. review and editing, E.T., W.P.R. and F.A.D.; visualization, E.T.; supervision, W.P.R.; project administration, W.P.R.; funding acquisition, W.P.R. All authors have read and agreed to the published version of the manuscript. **Funding:** Funding for the project was provided by the National Institute for Occupational Safety and Health (NIOSH) with grant number 75D30119C05500.

**Funding:** Funding for the project was provided by the National Institute for Occupational Safety and Health (NIOSH) with grant number 75D30119C05500. **Data Availability Statement:** The data presented in this study are not publicly available due to confidentiality.

**Data Availability Statement:** The data presented in this study are not publicly available due to confidentiality. **Acknowledgments:** We would like to express our gratitude to the National Institute for Occupational Safety and Health (NIOSH) for the funding, as well as the mining company, its management

**Acknowledgments:** We would like to express our gratitude to the National Institute for Occupational Safety and Health (NIOSH) for the funding, as well as the mining company, its management and all of the people supporting this research. and all of the people supporting this research. **Conflicts of Interest:** On behalf of all authors, the corresponding author states that there is no conflict of interest.

**Conflicts of Interest:** On behalf of all authors, the corresponding author states that there is no conflict of interest. **Appendix A** 

#### **Appendix A**

The ICE plots of the top ten features from the shift-based model are presented in Figure A1. The ICE plots of the top ten features from the shift-based model are presented in Figure A1.

**References** 

1–19.

doi:10.1016/j.smrv.2013.03.003.

*Explor.* **2020**, *37*, 1837–1846, doi:10.1007/s42461-020-00259-w.

**2019**, *10*, 188–195, doi:10.1016/j.shaw.2018.12.002.

• Adding some features related to the working schedule of the operator in terms of

• Exploring more details of each feature to reduce the number of features that have a

**Author Contributions:** Conceptualization, E.T. and W.P.R.; methodology, E.T., W.P.R. and T.M.; software, E.T. and T.M.; validation, W.P.R., E.T. and F.A.D.; formal analysis, E.T.; investigation, E.T.; resources, W.P.R.; data curation, E.T. and T.M.; writing—original draft preparation, E.T.; writing review and editing, E.T., W.P.R. and F.A.D.; visualization, E.T.; supervision, W.P.R.; project administration, W.P.R.; funding acquisition, W.P.R. All authors have read and agreed to the published

**Funding:** Funding for the project was provided by the National Institute for Occupational Safety

**Data Availability Statement:** The data presented in this study are not publicly available due to

**Acknowledgments:** We would like to express our gratitude to the National Institute for Occupational Safety and Health (NIOSH) for the funding, as well as the mining company, its management

**Conflicts of Interest:** On behalf of all authors, the corresponding author states that there is no con-

The ICE plots of the top ten features from the shift-based model are presented in

fatigue at the time and the day or week before;

and Health (NIOSH) with grant number 75D30119C05500.

and all of the people supporting this research.

lower impact on fatigue.

version of the manuscript.

confidentiality.

flict of interest.

**Appendix A** 

Figure A1.

**Figure A1.** ICE plots of the top ten features (Blue lines identify the dependence of the prediction on a feature for each instance independently, and yellow line is PDP which shows the average of the blue lines). **Figure A1.** ICE plots of the top ten features (Blue lines identify the dependence of the prediction on a feature for each instance independently, and yellow line is PDP which shows the average of the blue lines).

1. Bauerle, T.; Dugdale, Z.; Poplin, G. Mineworker fatigue: A review of what we know and future directions. *Min. Eng.* **2018**, *70*,

2. Drews, F.A.; Rogers, W.P.; Talebi, E.; Lee, S. The Experience and Management of Fatigue: *A Study Mine Haul. Oper. Min. Metall.*

4. Briggs, C.; Nolan, J.; Heiler, K. *Fitness for Duty in the Australian Mining Industry: Emerging Legal and Industrial Issues*; 186487421X;

5. Parker, T.W.; Warringham, C. Fitness for work in mining: Not a 'one size fits all' approach. In Proceedings of the Queensland

6. Hutchinson, B. "Fatigue Management in Mining—Time to Wake up and Act", Optimize Consulting. 2014. Available online:

7. Pelders, J.; Nelson, G. Contributors to Fatigue of Mine Workers in the South African Gold and Platinum Sector. *Saf. Health Work*

8. Cavuoto, L.; Megahed, F. Understanding fatigue and the implications for worker safety. In Proceedings of the ASSE Professional

Australian Centre for Industrial Relations Research and Teaching: Sydney, NSW, Australia, 2001; pp. 1–52.

http://www.tmsconsulting.com.au/basic-fatigue-management-in-mining/ (accessed on 25 January 2020).

Mining Industry Health & Safety Conference, Brookfield, QLD, Australia, 4–7 August 2004.

Development Conference and Exposition, Atlanta, GA, USA, 26 June 2016.

fatigue risk management system for the road transport industry. *Sleep Med. Rev.* **2014**, *18*, 141–152,

## **References**


**Rajive Ganguli 1,\* , Preston Miller <sup>2</sup> and Rambabu Pothina <sup>1</sup>**


**Abstract:** To achieve the goal of preventing serious injuries and fatalities, it is important for a mine site to analyze site specific mine safety data. The advances in natural language processing (NLP) create an opportunity to develop machine learning (ML) tools to automate analysis of mine health and safety management systems (HSMS) data without requiring experts at every mine site. As a demonstration, nine random forest (RF) models were developed to classify narratives from the Mine Safety and Health Administration (MSHA) database into nine accident types. MSHA accident categories are quite descriptive and are, thus, a proxy for high level understanding of the incidents. A single model developed to classify narratives into a single category was more effective than a single model that classified narratives into different categories. The developed models were then applied to narratives taken from a mine HSMS (non-MSHA), to classify them into MSHA accident categories. About two thirds of the non-MSHA narratives were automatically classified by the RF models. The automatically classified narratives were then evaluated manually. The evaluation showed an accuracy of 96% for automated classifications. The near perfect classification of non-MSHA narratives by MSHA based machine learning models demonstrates that NLP can be a powerful tool to analyze HSMS data.

**Keywords:** mine safety and health; accidents; narratives; machine learning; natural language processing; random forest classification

#### **1. Introduction**

Workers' health and safety is of utmost priority for the sustainability of any industry. Unfortunately, occupational accidents are still reported in high numbers globally. According to the recent estimates published by the International Labour Organization (ILO), 2.78 million workers die from occupational accidents and diseases worldwide [1]. In addition, 374 million workers suffer from non-fatal accidents, and lost work days represent approximately 4% of the world's gross domestic product [2,3]. It is, therefore, not surprising that researchers are constantly investigating factors that impact safety [4,5], or finding innovations and technology to improve safety [6,7].

As to the U.S. mining industry, for years 2016–2019, the National Institute for Occupational Safety and Health (NIOSH), a division of the US Centers for Disease Control and Prevention (CDC) reports 105 fatal accidents and 15,803 non-fatal lost-time injuries [8]. To bring down the rate of serious injuries and fatalities, the industry analyzes incident reports to conduct root cause analysis and identify leading indicators. Unfortunately, as noted by the International Council on Mining and Metals, a global organization of some of the largest mining companies of the world, the vast trove of incident data is not analyzed as much as it could be due to lack of analytics expertise at mine sites [9]. With the advances in natural language processing (NLP), there is now an opportunity to create NLP-based tools to process and analyze such textual data without requiring human experts at the mine site.

**Citation:** Ganguli, R.; Miller, P.; Pothina, R. Effectiveness of Natural Language Processing Based Machine Learning in Analyzing Incident Narratives at a Mine. *Minerals* **2021**, *11*, 776. https://doi.org/10.3390/ min11070776

Academic Editor: Yosoon Choi

Received: 22 May 2021 Accepted: 14 July 2021 Published: 17 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Natural language processing (NLP) has been explored as a tool to analyze safety reports since the 1990s [10,11]. This paper, intended for a mining industry audience, presents in this section, a brief history of NLP and its use in analyzing safety reports. NLP is the automated ability to extract useful information out of written or spoken words of a language. Exploring its application to safety is logical, as safety reports are valuable information. If causation and associated details can be automatically extracted from the safety reports, NLP can be used to quickly gain insight into safety incidents from historical reports that are filed away in the safety management databases. Additionally, with smartphone-based work site observations apps becoming popular, NLP tools can be useful in providing real time insights as incidents and observations are reported in real time. For example, in a confidential project, one of the authors of this paper advised an industrial site about a hazardous practice at the operation using an NLP analysis of data collected using a smartphone-based application. This hazard became apparent after evaluating the data because several employees had noted the practice in their worksite observations.

The efforts to apply NLP to extract causation from safety reports received a major boost when the Pacific Northwest National Laboratory (PNNL) put together a large team in the early 2000s to apply NLP and analyze aviation safety reports from the National Aeronautics and Space Administration's (NASA) aviation safety program [12]. The "meaning" of a sentence depends not just on the words, but also on the context. Therefore, PNNL used a variety of human experts to develop algorithms to extract human performance factors (HPF) from report narratives. HPF definitions were adopted from NASA [13]. The PNNL approach consisted of artificial intelligence (AI) after the text was preprocessed using linguistic rules. The linguistic rules, developed by human experts, considered specific phrases and sentence structures common in aviation reports. When automated, these rules were able to identify causes of safety incidents on par with human experts. The PNNL team, however, noted the reliance of the algorithms on human experts with domainspecific knowledge.

New developments have reduced human involvement in text analysis [14]. These developments include identifying linguistic features such as parts of speech, word dependencies, and lemmas. A million-sentence database (or "corpus" to use NLP terminology) may only contain 50,000 unique words once words such as 'buy' and 'bought' (one is a lemma of the other) are compressed into one; though that is also a choice for the human expert. After vectorization, each sentence in the database is a vector of length 50,000, with most elements being zero (a twelve-word sentence will only have ones in twelve places). When the relative order of words in a sentence is taken into account, common phrases can be identified easily. Thus, after preprocessing with NLP techniques, classical statistics and machine learning techniques can be applied to classify text. Baker et al., 2020 [15] used a variety of NLP and machine learning techniques to classify incident reports and predict safety outcomes in the construction industry. Tixier et al., 2016 developed a rule based NLP algorithm that depends on a library of accident related keywords to extract precursors and outcomes from unstructured injury reports in the construction industry [16]. In another study that was conducted on narratives from Aviation Safety Reporting System (ASRS), NLP-based text preprocessing techniques along with k-means clustering classification were used to identify various safety events of interest [17]. Baillargeon et al., 2021 [18] used NLP and machine learning techniques to extract features of importance to the insurance industry from public domain highway accident data. In an analysis conducted on infraction history of certain mine categories, ML-based classification and regression tree (CART) and random forest (RF) models were used on Mine Safety and Health Administration (MSHA) database narratives in predicting the likely occurrence of serious injuries in near future (the following 12-month period) [19].

The application of NLP-based machine learning to mining industry safety data is relatively new. Yedla et al., 2020 [20] used the public domain (MSHA) database to test the utility of narratives in predicting accident attributes. They found that vectorized forms of narratives could improve the predictability of factors such as days away from work. Other researchers used NLP to analyze fatality reports in the MSHA database [21]. Using co-occurrence matrices for key phrases, they were able to identify some of the common causes of accidents for specific equipment.

#### **2. Importance of this Paper**

In safety-related research, it is typical to demonstrate NLP and machine learning capabilities on public domain databases. Models are first developed on a public domain database, after which its capabilities are demonstrated on an independent subset of the same database. Since modeling and subsequent demonstration of model capabilities happen on the same dataset, there is no certainty that these approaches or models would be effective on databases created by other sources. For example, every entry in an MSHA database is made by a federal employee. Would a federal employee describe an incident the same way as a mining company employee? If yes, then there exists a specific language for mine safety that is shared by safety professionals. This 'language', if it exists, can be leveraged to make NLP-based machine learning of mine safety data very effective.

This paper advances the use and application of NLP to analyze mine safety incident reports by demonstrating that machine learning models developed on public domain mine safety databases can be applied effectively on private sector safety datasets. Therefore, it demonstrates that there is a language of safety that spans organizations. Furthermore, this paper identifies key attributes of specific categories of incidents. This knowledge can be used to improve algorithms and/or understand their performance.

More generally, the paper advances the field of mine safety research. Currently, data-mining-based mine safety researchers focus only on categorical or numerical data. Therefore, gained insights are limited to statistical characterization of data (such as average age, or work experience) or models based on these data [4]. If narratives are available with incident data (as they often are), this paper will encourage researchers to evaluate them to glean more insights into the underlying causes.

#### **3. Research Methodology**

#### *3.1. MSHA Accident Database*

The MSHA accident database [22] has 57 fields used to describe safety incidents including meta-data (mine identification, date of incident, etc.), narrative description of the incident, and various attributes of the incidents. Some of the data is categorical such as body part injured and accident type. More than eighty-one thousand (81,298) records spanning the years 2011 to early 2021 were used in this research. Any operating mine in the United States that had a reportable injury is in the database. Thus, the database reflects many types of mines, jobs, and accidents.

Accidents are classified in the database as belonging to one of 45 accident types. Examples include "Absorption of radiations, caustics, toxic and noxious substances", "Caught in, under or between a moving and a stationary object", and "Over-exertion in wielding or throwing objects". Looking at these definitions, it appears that MSHA defined them to almost answer the question "What happened?" Thus, the category is simply the high level human summary of the narrative, i.e., the category is the "meaning" of the narrative. In this paper, the MSHA accident type is considered a proxy for the meaning of the narrative. Narratives are typically five sentences or less.

#### *3.2. Random Forest Classifier*

The random forest (RF) technique was used to classify the narratives based on accident types. Random forests are simply a group of decision trees. Though described here briefly, those unfamiliar with decision trees are referred to Mitchell, 1997 [23], a good textbook on the topic and the source for the description below. A decision tree is essentially a series of yes or no questions applied to a particular column ("feature") of the input data. The decision from the question (for example, miner experience > 10, where miner experience is a feature in the data set) segments the data. Each question is, thus, a "boundary" splitting

the data into two subsets of different sizes. The segmented data may be further segmented by applying another boundary, though the next boundary may be on another feature. Applying several boundaries one after the other results in numerous small subsets of data, with data between boundaries ideally belonging to a single category. The maximum number of decision trees applied in the longest pathway is called the "tree depth". The method works by applying the sequence of boundaries to a sample, with the final boundary determining its class. Note that while one boundary (also called "node") makes the final decision on the class for one sample, some other boundary may make the decision for another sample. It all depends on the path taken by a particular sample as it travels through the tree. When the final boundary does not result in a unanimous class, the most popular class in the subset is used as the final decision of the class.

Boundaries are set to minimize the error on either side of the boundaries. The combination of a given data set and given boundary criteria will always result in a specific tree. In an RF, a decision tree is formed by randomly selecting (with replacement) the data. Thus, while a traditional decision tree will use the entire modeling subset for forming the tree, a decision tree in an RF will use the same amount of data, but with some samples occurring multiple times, and some not occurring at all. Thus, the same data set can yield multiple trees. In the RF technique, multiple trees formed with a random selection of data are used to classify the data. One can then use any method of choice to combine predictions from the different trees. This method of using a group of trees is superior to using a single decision tree.

In this paper, an RF classifier was applied to model the relationship between a narrative and its accident type. A non-MSHA database would contain narratives, but not any of the other fields populated by MSHA staff. Since the goal of the project is to test it on non-MSHA data, no other field in the database was used to strengthen the model. Half of the records were randomly selected to develop the model. It was tested on the remaining half of the records to evaluate its performance on the MSHA data. In the final step, the model was tested on non-MSHA data. There is no standard for what proportion of data to use for training and testing subsets, though it is expected that the subsets be similar [24]. A 50–50 split is a common practice [25,26]. RF models were developed using the function RandomForestClassifier () in the SCIKIT-LEARN [27] toolkit. As is common practice in machine learning [28], the authors did not code the RF but used a popular tool instead.

Modeling starts by making a list of non-trivial words in the narratives. As is typical in NLP, the narratives were pre-processed before the list of non-trivial words is made. Pre-processing consisted of:


The combined length of all narratives was 1.72 million words, consisting of 31,995 unique words or "features". The list of unique features is called the vocabulary. The input data set is then prepared by selecting the top 300 most frequently occurring words ("max features"). Essentially, the vocabulary is cut from its full length to just the words occurring most frequently. These words are used to vectorize each narrative such that each narrative is represented as a vector of size 300. The value at a given location in the vector would

represent the number of occurrences of that word in that narrative. The top 5 words were: fall, right, left, back, and cause.

The output for the narrative consisted of a 1 or a 0, indicating whether it belonged ("1") to a particular category of accident or not ("0"). "Max features" is a parameter in RF modeling, and was set to 300 after trial and error exercises. Similarly, the number of trees ("n\_estimators") was set to 100. Another parameter is "max\_depth" (maximum depth of tree). This parameter was not set. Whenever a parameter is not specified, the tool uses default values. In the default setting for tree depth, data is continually segmented till the final group is all from the same class. According to the user guide of the tool, the main parameters are the number of trees, and max features. The rest of the parameters were not set, i.e., default values were used. The interested reader can visit the provided links for technical details about the toolkits in the footnotes, including the default values. The tool combines the outputs of the various trees by averaging them to obtain the final classification.

Among the 45 accident types are some whose names start with the same phrase. For example, there are four over-exertion (OE) types, all of which start with the phrase over-exertion. They are (verbatim): Over-exertion in lifting objects, over-exertion in pulling or pushing objects, over-exertion in wielding or throwing objects, and over-exertion NEC. Accident categories whose names begin with the same phrase are considered to belong to the same "type group", with the phrase defining the grouping.

NEC stands for "not elsewhere classified," and is used within some type groups. When it exists, it is often the largest sub-group as it is for everything that is not easily defined. There are 11 types that start with "Fall", including two that start with "Fall to". Five types start with "Caught in". Six start with "Struck by". These accident type groups contain 26 of the 45 accident types, but 86% of all incidents (35,170 out of 81,298). Table 1 shows the four type groups that were modeled in this paper. Separate models were developed for some of the sub-groups to get an understanding of these narrowly defined accidents. These were:



**Table 1.** The four type groups of accidents modeled in the paper.

Thus, a total of nine RF models were developed; four for the four type groups, and five for the specific types. Table 2 shows the characterization of the training and testing subsets that went into developing the models. It is apparent that each category was represented about the same in the two subsets.


**Table 2.** Various accident categories in the training and testing subsets. Each subset has 40,649 samples.

In classification exercises, it is common to develop a single model to classify a data set into multiple categories, rather than develop models for each category individually. The reason for developing nine models instead of one is discussed in the next section.

## **4. Results**

## *4.1. Performance within MSHA Data*

Table 3 shows a summary of the modeling within the MSHA test set. To understand the table, consider the OE type group. Of the 40,649 records in the test set, 8979 records were from this type. The success of an RF model can be determined by identifying the OE type as OE type and/or by classifying a non-OE type (31,670 records) as not belonging to OE. This is shown below through a simple computation.


**Table 3.** Results of RF models in the MSHA test set.


The overall success was 92%, i.e., a very high proportion of narratives were classified correctly as belonging to OE type group, or as not belonging to OE type group. Though it is an indicator of overall success, this type of evaluation is not particularly useful, as classifying a narrative as "not belonging to OE" is not helpful to the user. It is more useful to look at how successful RFs were in correctly identifying narratives from the accident type in question (OE type group in this example). As shown in the table and in the example computation, 81% of these 8918 (7248) were accurately identified. The false positive rate was 4%, i.e., 1331 of the 31,670 non-OE records were identified as OE. The low positive rate implies that if a narrative was classified as belonging to the OE type group, it was highly likely to belong to that type. The success in the other type groups was lower, and ranged from 71% to 75%, with false positives ranging from 1% to 5%. Thus, one could expect RF to accurately identify about 75% of the narratives in the MSHA database from the four type groups, with a good false positive rate.

The success rate takes a dramatic downturn with the individual models. Only 25% to 59% of narratives belonging to the individual types are correctly classified though with a negligible false positive rate. The negligible false positive implies that when the model classifies the narrative as belonging to a specific category, it is almost guaranteed to be in that category. The low number of records in the individual categories is one part of the explanation of the poor performance, as models would be less powerful if they are trained on fewer records. For example, only about 3% of the records were from the OEP category. This means that 97% of the data seen by the OEP model was not relevant to identifying OEP. An additional explanation is obtained from trigram analysis of the narratives that belong to these accident types. Trigrams explore the sets of three words that occur consecutively the most. Trigram analysis was conducted using the NLTK collocations toolkit.

Table 4 shows the tri-word sequences that occur the most frequently in the OE accident types. They are listed in order of frequency. The overlap between the tri-words is immediately apparent. Back, shoulders, knee, abdomen, and groin are injured most in these types of accidents. The overlap between OEP and OEL would cause accidents to be misclassified as belonging to the other category. This issue is also evident in the Fall accident types (Table 5), where losing balance, slipping, and falling seem to be the major attributes. Even the two types "Caught in" and "Struck by" have some overlap (Table 6). Caught in makes it apparent that it is the fingers that are predominantly injured in this type of accident. SFO highlights that eyes and safety glasses are impacted when someone is struck by a flying object.


**Table 4.** Results of trigram analysis on OE accident types.

**Table 5.** Results of trigram analysis on Fall accident types.



**Table 6.** Results of trigram analysis on Caught in and Struck by accident types.

The success rate for classification was dramatically lower when a single RF model was developed to classify the narratives into separate categories. OEP, OEL, FWW, CIMS, SFO had success rates of only 23%, 33%, 19%, 29%, and 17% respectively compared to 37%, 59%, 34%, 55%, 25% respectively. Multiple models for multiple categories would require that multiple models be applied to the same data, resulting in multiple predictions of category. It would be possible then for a particular narrative to be categorized differently by the different models. In such situations, one could determine the similarity between the narrative and the narratives from the multiple categories in the training set to resolve the conflicting classifications. The features (words) of the category within the training set are the foundation behind the model for the category. For example, the words in the "Struck by" category in the training set play a key role in what RF trees are formed in the "Struck by" model. Thus, when a test narrative is classified as "Struck by" by one model, and "Caught in" by another, one could find the similarity between words in the test narrative, and the words in the two categories of the training data, "Struck by" and "Caught in", to resolve the conflict. This is demonstrated in the next section.

#### *4.2. Performance on Non-MSHA Data*

The nine RF models were applied to data from a surface metallic mine in the United States that partnered in this project. The data consisted of narratives that described various safety incidents. Injury severity ranged from very minor incidents to lost time accidents. Narratives were typically longer than MSHA narratives (about twice the length), and formats were sometimes different (such as using a bulleted list). They usually had more details about the incident. The narratives were written by a staff member from the safety department. Narratives from the 119 unique incidents logged in 2019 and 2020 were analyzed. Some narratives were duplicated in the database. Duplicates of narratives were ignored. Each model was applied to the 119 narratives separately.

The RF models classified 76 out of the 119 narratives (Table 7) with a high degree of success. 17 narratives were classified by multiple models, but not misclassified (explained later). Forty-three (43) narratives were ignored by all nine models, i.e., they were not classified as belonging to a particular category. The classifications were manually evaluated by the authors to see if they would match the MSHA Accident Types. In many cases, the MSHA database contained an accident that was not only similar to the narrative being manually evaluated but was also classified into the same accident type as the narrative in question. Therefore, the manual validation was easy. A narrative was deemed as accurately classified if it was also classified as such by the authors. The 43 narratives that were not classified by any of the nine models could possibly belong to one of the 19 MSHA accident types not modeled in this paper. The overall success rate was 96%.



The OE category is quite broad and, therefore, one would expect some narratives to be wrongly classified as OE. Therefore, it is not surprising that 4 out of the 26 classified as OE did not belong in that category. One narrative involved an employee who had a pre-existing soreness in the wrist. The 'incident' was simply the employee reporting to the clinic. Two incidents involved employees backing into or walking into a wall or object while working. The fourth incident involved chafing of the calves from new boots. Some of these incidents would perhaps have been also classified differently had models been developed for the other accident types.

Table 8 shows examples of some of the narratives and the automated classifications. Examples are shown for the narrowest categories as they would normally be the most challenging to identify. Table 9 shows how the overlapping occurred in the 17 narratives. Three narratives were classified as both Fall and FWW, while seven were categorized as both "Caught in" and CIMS. Since nine models were used in parallel, it was possible for each narrative to be categorized into nine different categories. Yet, no narrative was categorized as belonging to three or more different categories. Except for one, these overlaps should be expected. For example, OEL is a subset of OE. Therefore, a narrative classified as OEL by the OEL model is expected to be also classified as OE by the OE model. The overlap between a type group and one of its sub-type is a confirmation that models are working properly. It is good that there was no overlap between OEL and OEP. The overlap between "Caught in" and "Struck by" was surprising as they are different categories. The narrative that was classified as both "Caught in" and "Struck by" is (verbatim): "while installing a new motor/pump assy. using portable a cherry picker, the cherry picker tipped over and the assembly caught the employee leg and ankle between the piping and the motor assembly." Tools and equipment that tip over and cause injury have been reported in the "Struck by" category in the MSHA database. A limb caught in between two objects is reported in the "Caught in" category in the MSHA database. Thus, the RF models were correct in their classification of the narrative. However, the overlap in classification presents a good opportunity to demonstrate how one could use "similarity scores" to resolve the overlap. The steps of the process, to resolve conflicting classifications of "Caught in" and "Struck by" are:


a top word than "leg" did as a top word. Clearly, "catch" is a bigger determiner of "Caught in" than leg is of "Struck by".

6. The decision as to which category the narrative belongs is the one with the highest similarity score. In this case, the narrative is deemed to be of the category "Caught in".

**Table 8.** Examples from the partner mine HSMS, and the automated classifications. Narratives are shown verbatim, but some text has been deleted (identified by . . . ) to not disclose sensitive information.


**Table 9.** Counts of overlapping accident types.


#### **5. Discussion**

Two thirds of the narratives in the partner database could be successfully classified (96% accuracy) without any human intervention. The narratives that are not automatically classified could belong to categories not modeled in this paper. At this time, they were not manually analyzed to determine their nature. The nearly absent overlap in predictions for distinct accident types is encouraging as that allows the multiple-model-for-multiplecategory approach to work. That is further strengthened by the low false positive rates for the distinct categories, i.e., when a particular model for a distinct category (say OEP) claims that a narrative belongs to that category, the classification is most likely valid. The similarity score approach is presented to resolve cases where a narrative is classified into multiple categories due to the use of multiple models.

The classifications done in the paper were not an empty computational exercise thanks to how MSHA classified the accidents. An increase in narratives being classified as SFO would tell management that foreign matter was entering the eyes of their employees. This is the same as humans reading the narratives, understanding them, and reaching that conclusion. Thus, in some sense, the RF models picked up what the narratives "meant". The high classification success rate also meant that there were specific ways safety professionals describe incidents and that NLP tools can extract that language.

These tools have excellent applicability to help the mining industry reach the industry goal of preventing serious injury and fatalities. On noting an increase in SFO classifications, management can deploy eye protection related interventions. An increase in OEL incidents could result in more training about safe lifting. The safety "department" in most mines means a single person with no mandate or expertise to analyze data. These types of tools can assist mines to analyze data without human intervention. As mines deploy smartphonebased apps to collect employee reports on worksites, the volume of information will

explode. However, these tools will help mines process that data and identify hazards before they become incidents.

The detection rate for the narrowest of categories needs to be improved. Improving this would be the most logical next step for this research. A reason why NLP tools were not always effective may be how incidents are described in the narratives. A limitation of the approach is that it is dependent on the terminology and the writing style. For example, "roof bolter" related incidents may not be detected by NLP in narratives when the writer uses the term "pinner" to refer to a bolter (though the diligent NLP developer would notice the frequent occurrence of "pinner" in narratives involving "roof"). "Pinner" is a common term for roof bolters in certain parts of the US. Terminology aside, writing style can vary dramatically depending on the region and the English language abilities of the writer. Considering all of these, the MSHA database may not be a great resource for English based NLP tools in other parts of the world. Regardless, organizations (or nations) developing their own NLP tools could provide training to standardize the writing of safety narratives, so that data is generated to assist automation.

The extremely low false positive rate for the narrowest accident types is a wonderful argument for considering these tools. The overall false positive rate across all accident types is quite low, which is good.

#### **6. Conclusions**

Natural language processing based random forest models were developed to classify narratives in the MSHA database depending on accident types. Nine models were developed. Four of the models, i.e., Over-exertion, Fall, "Caught in" and "Struck by", looked at type groups, i.e., groups of particular accident types. Five models looked at specific accident types within these broad groups. They were: Over-exertion in lifting objects, Over-exertion in pulling or pushing objects, Fall to the walkway or working surface, "Caught in", under or between a moving and a stationary object, and Struck by flying object. All models had high overall success rates (typically 95% or higher) in classification on MSHA data when considering both false positive and false negative rates. The success in detecting an accident type within a narrative was higher for type groups (71–81%) than for individual categories (25–59%). Detection was done with low false positive rates for type groups (1–5%), and extremely low false positive rate (<1%) for individual categories.

When a single model was developed to classify narratives into multiple categories, it did not perform as well as when a separate model was developed for each category. A similarity score based method was developed to resolve situations where a particular narrative may be classified differently according to different models.

When applied to non-MSHA data, the developed models were successful in classifying about two-thirds of the narratives in a non-MSHA database with 96% accuracy. The narratives that are not classified by the models could belong to accident types not modeled in this paper. In classifying the non-MSHA narratives with near perfect accuracy, the paper demonstrates the utility of NLP-based machine learning in mine safety research. It also demonstrates that there exists a language for mine safety, as models developed on narratives written by MSHA personnel apply to narratives written by non-MSHA professionals. They also demonstrate that natural language processing tools can help understand this language automatically.

**Author Contributions:** Conceptualization, R.G.; data curation, P.M. and R.P.; formal analysis, R.G., P.M., and R.P.; funding acquisition, R.G.; investigation, R.G., P.M., and R.P.; methodology, R.G., P.M., and R.P.; validation, R.G., P.M., and R.P.; visualization, R.G. and P.M.; writing—original draft, R.G.; writing—review & editing, P.M. and R.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This project was partially supported by the National Institute of Occupational Safety and Health.

**Data Availability Statement:** Not Applicable.

**Acknowledgments:** This project was partially supported by the National Institute of Occupational Safety and Health. Their support is gratefully acknowledged.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**

