*3.5. Optimal LSTM Architecture*

From the training dataset, sequence {**x**1, .., **x***τ*} of length *τ* are extracted to train the LSTM model in order to forecast the operational relative-hardness at next time step *τ* + 1, at different time supports. The chosen length is four hours (*τ*: 8).

The external hyper-parameter to be optimized on any LSTM architecture is the number of hidden units, *nH*. Based on a previous work [14], the optimum number of hidden units was found and here used. They are displayed in Table 3.

Adam Optimizer is used to train the LSTM with hyper-parameters = <sup>1</sup> × *<sup>e</sup>*<sup>−</sup>8, *<sup>β</sup>*<sup>1</sup> = 0.9 and *β*<sup>2</sup> = 0.999 as recommended by [17].

**Table 3.** Optimal number of hidden units in the LSTM architecture at different time supports [14].


#### **4. Results**

Directly from the datasets, the real operational relative-hardness ORH*<sup>R</sup>* is calculated from Equation (1), varying *λ* in the set (0.5, 0.6, ..., 1.4, 1.5) at each time *t* and for each time support. On the other hand, a probability vector with soft, undefined and hard ORH states is predicted. By taking the highest probability, the predicted ORH*<sup>P</sup>* is obtained. Then, a confusion matrix, filled with the number of instances of pairs (RH*R*, RH*P*), is built for each time support and each *λ* value. Table 4 summarizes and presents the cases of *λ*: 0.5, 1.0 and 1.5, and supports 0.5, 2 and 8 h over the SAG mill 1, while the Table 5 summarizes the same results over the SAG mill 2.

**Table 4.** SAG mill 1. Confusion matrices (number of instances) of operational relative-hardness (ORH) predictions using *λ*: 0.5, 1.0 and 1.5 at 0.5, 2 and 8 h time supports.


**Table 5.** SAG mill 2. Confusion matrices (number of instances) of ORH predictions using *λ*: 0.5, 1.0 and 1.5 at 0.5, 2 and 8 h time supports.


The accuracy of the model prediction, ORH*P*, defined as the percentage of right predictions is computed as:

$$ORH\_{Accuracy} = \frac{\#\left(\mathbf{soft}\_{\mathcal{R}}, \mathbf{soft}\_{\mathcal{P}}\right) + \#\left(\mathbf{und}\_{\mathcal{R}}, \mathbf{und}\_{\mathcal{P}}\right) + \#\left(\mathbf{hard}\_{\mathcal{R}}, \mathbf{hard}\_{\mathcal{P}}\right)}{\#Total} \cdot 100\tag{3}$$

and it represents the percentage of elements in the confusion matrix diagonal. The relative percentage of predictions of each class (rows) is shown in Table 6 for SAG mill 1 and in Table 7 for SAG mill 2.

As shown in Tables 6 and 7 at 0.5 h time support, the LSTM is able to predict with enough confidence the ORH regardless the value of *λ*. Nevertheless, as *λ* increases, the number of instances of soft and hard ORH decreases improving the final accuracy since the higher the value of *λ*, the more data points are classified as undefined. Particularly, for 0.5 h time support, increasing *λ* from 0.5 to 1.5 makes real undefined points increase from 4325 to 6577 (from 53.0% to 80.7%) in SAG mill 1 and from 3600 to 6469 (from 45.3% to 81.4%) in SAG mill 2. Therefore, increasing *λ* improves accuracy, but the price is resolution. On the other hand, the number of extreme cases (**soft***R*, **hard***P*) and (**hard***R*, **soft***P*) is close to zero. This is a great result, since predicting soft hardness when it is actually hard (or vice versa) may induce bad short term decisions on how to operate the SAG mill, along with other downstream decisions.

**Table 6.** SAG mill 1. Confusion matrices (percentage) of ORH prediction using *λ*: 0.5, 1.0 and 1.5 at 0.5, 2 and 8 h time supports.


**Table 7.** SAG mill 2. Confusion matrices (percentage) of ORH prediction using *λ*: 0.5, 1.0 and 1.5 at 0.5, 2 and 8 h time supports.


The percentage of extreme cases (**soft***R*, **hard***P*) and (**hard***R*, **soft***P*) using *λ*: 0.5 increases when moving from 0.5 to 8 h time support, on both SAG mills. However, they decrease to a value close to zero when increasing *λ* from 0.5 to 1.5, at all time supports. However, LSTM loses accuracy in terms of predicting the relevant cases (**soft***R*, **soft***P*) and (**hard***R*, **hard***P*) as soon as the time support increases, on both SAG mills.

The accuracy graph (Figure 3) shows the *λ* sensitivity at all time supports on both SAG mills. The lower accuracy is 51% and is achieved at 2 h time supports with *λ*: 0.5 on SAG mill 1. Its accuracy increases to 66% with *λ*: 1.0 and 81% with *λ*: 1.5. The best results are achieved at 0.5 h time support (same support as the original data) where 77%, 88% and 93% of accuracy are obtained with *λ*: 0.5, 1.0

and 1.5, respectively on SAG mill 1, and 79%, 85% and 90% of accuracy with *λ*: 0.5, 1.0 and 1.5 on SAG mill 2.

**Figure 3.** Accuracy of operational relative-hardness prediction at different time support as function of lambda (*λ*) on both SAG mills.

#### **5. Conclusions**

This work proposes the use of Long Short-Term Memory networks to forecast relative operational hardness in two SAG mills using operational data. We have presented the internal architecture of the deep networks, how to deal with raw operational datasets, and qualitative criteria to estimate the operational hardness of processing material inside the SAG mill based on the consumed energy, feed tonnage and a statistical distribution using a lambda value. Particularly, Long Short-Term Memory models have been trained to predict the operational relative-hardness based only on low-cost and fast acquiring operational information (feed tonnage, spindle speed and bearing pressure).

The LSTM network shows great results on predicting the relative operational hardness at 30 min time support. On SAG mill 1, using a lambda value of 0.5, the obtained accuracy was 77.3% while increasing the lambda to 1.5 led to an increase in accuracy of 93.1%. Similar results were found on the second SAG mill. As the time support increases to two and eight hours, the accuracy drops to around 52% using a lambda value of 0.5 and 78% with a lambda value of 1.5, on both SAG mills.

The inaccuracy of LSTM, when predicting extreme cases such as soft hardness when it is hard and vice-versa, is pretty low. Extreme misclassification is close to 1% at 0.5 h time support on both SAGs regardless of the lambda value. Although it increases to around 20% when increasing the time support using a lambda value of 0.5, it rapidly decreases to around 1% as lambda increases.

Lastly, the proposed application can be extended to any crushing and grinding equipment, under a similar context of real-data acquisition in order to forecast categorical attributes that are relevant to downstream processes.

**Author Contributions:** Conceptualization, S.A. and W.K.; methodology, S.A.; codes, S.A.; validation, S.A., W.K. and J.M.O.; formal analysis, S.A.; investigation, S.A.; resources, W.K.; data curation, S.A.; writing—original draft preparation, S.A.; visualization, S.A.; supervision, W.K. and J.M.O.; project administration, W.K.; funding acquisition, W.K. and J.M.O. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Natural Sciences and Engineering Council of Canada (NSERC) grant number RGPIN-2017-04200 and RGPAS-2017-507956, and by the Chilean National Commission for Scientific and Technological Research (CONICYT), through CONICYT/PIA Project AFB180004, and the CONICYT/FONDAP Project 15110019.

**Conflicts of Interest:** The authors declare no conflict of interest.

*Minerals* **2020**, *10*, 734
