Study on Prediction of Zinc Grade by Transformer Model with De-Stationary Mechanism

Peng, Cheng; Luo, Liang; Luo, Hao; Tang, Zhaohui

doi:10.3390/min14030230

Open AccessArticle

Study on Prediction of Zinc Grade by Transformer Model with De-Stationary Mechanism

¹

School of Computer, Hunan University of Technology, Zhuzhou 412007, China

²

School of Automation, Central South University, Changsha 410083, China

^*

Author to whom correspondence should be addressed.

Minerals 2024, 14(3), 230; https://doi.org/10.3390/min14030230

Submission received: 3 January 2024 / Revised: 5 February 2024 / Accepted: 20 February 2024 / Published: 25 February 2024

(This article belongs to the Section Mineral Processing and Extractive Metallurgy)

Download

Browse Figures

Versions Notes

Abstract

:

At present, in the mineral flotation process, flotation data are easily influenced by various factors, resulting in non-stationary time series data, which lead to overfitting of prediction models, ultimately severely affecting the accuracy of grade prediction. Thus, this study proposes a de-stationary attention mechanism based on the transformer model (DST) to learn non-stationary information in raw mineral data sequences. First, normalization processing is performed on matched flotation data and mineral grade values, to make the data sequences stationary, thereby enhancing model prediction capabilities. Then, the proposed de-stationary attention mechanism is employed to learn the temporal dependencies of mineral flotation data in the transformed vanilla transformer model, i.e., non-stationary information in the mineral data sequences. Lastly, de-normalization processing is conducted to maintain the mineral prediction results within the same scale as the original data. Compared with existing models such as RNN, LSTM, transformer, Enc-Dec (RNN), and STS-D, the DST model reduced the RMSE by 20.8%, 20.8%, 62.8%, 20.5%, and 49.1%, respectively.

Keywords:

grade prediction; non-stationary information; transformer

1. Introduction

Mineral grade is a crucial performance indicator for froth flotation [1,2], and accurate real-time prediction of mineral grade assists flotation operators in better recognizing the current flotation conditions, allowing for subsequent measures for extracting crucial minerals. Consequently, investigating grade monitoring during the froth flotation process is imperative.

In traditional flotation devices, operators often estimate grades by observing froth visual characteristics closely related to mineral grades. However, this monitoring method is highly subjective and the accuracy is easily influenced by the operator’s level of experience.To enhance the precision and stability of mineral grade monitoring, X-ray fluorescence (XRF) analyzers have been developed and put into use in modern flotation plants. However, XRF analyzers are expensive and difficult to maintain. To reduce costs, flotation plants often use one XRF analyzer to monitor multiple mineral grades simultaneously, which increases measurement intervals [3], limiting their application in fully automated flotation control.

In recent years, with the advancement of digital image processing technologies, an increasing number of visual features have been extracted for the purpose of constructing mineral grade monitoring methods [4,5]. The features of a froth image can be divided into static features, dynamic features, and statistical features. The static features include bubble size and froth color, which are the most obvious features in froth flotation conditions; the dynamic features include froth stability and load bearing rate, reflecting the overall change trend in the froth; and the statistical features mainly refer to the texture of bubbles, which reflects the fineness of mineral granules attached to the surface of the froth. For instance, Popli et al. [6] employed a support vector machine (SVM) to investigate the correlation between multiple visual features extracted from froth images and mineral grades, aiming to predict grade outcomes, whereas Bendaouia et al. [1] predicted the grade of a mineral concentrate by extracting features such as bubble size, color, and texture.

In essence, froth image data are stored in a time-sequential manner, rich with temporal information and representing the dynamic change trends between data. To mine useful information from these time-series data, various deep learning networks such as recurrent neural networks (RNN), long-short term memory (LSTM), and their variant models [7,8,9], as well as other time-series networks, have been applied for the performance monitoring of flotation processes. Zhang et al. [10] introduced a froth flotation grade monitoring method based on long short-term memory (LSTM) by using froth video sequences. Based on RNN, Yuan et al. [3] achieved grade monitoring by processing the nonlinear relationships between relevant data in the flotation process and mineral grades. Zhang et al. [11] proposed using an encoder–decoder structure combined with RNN models to predict the grade of zinc tailings. However, dependency relationships over longer time spans are difficult to capture using RNN and LSTM models. A taste prediction method based on the transformer model was proposed by Peng et al. [12]. The vanilla transformer used for time series prediction benefits from the long-term dependency capturing capability of the attention mechanism, allowing for effective capturing of temporal dependencies and nonlinear relationships in time series data [13]. It occupies a dominant position in research on neural network-based long-term forecasting [14,15].

However, the vanilla transformer model applied for grade monitoring assumes that froth data are stationary, neglecting the non-stationarity factors in mineral froth data. Non-stationary data refer to data in a time series that exhibit instability or unpredictability. The defining characteristic of non-stationary data is that their statistical properties, such as mean, variance, etc., change over time, indicating that the distribution over time is not constant [16]. In the zinc ore flotation process, the characteristics of the feed material are affected by time, resulting in non-stationary flotation data [17,18]. It is difficult to describe these data using a single model or rule. Recently, some studies have shown that the transformer model is relatively weak in handling non-stationary time series, which can easily lead to severe overfitting phenomena [19].

To mitigate the decline in predictive performance caused by non-stationary data, classical statistical methods such as ARMA and ARIMA [20] can achieve stationarity of time series through difference operations. However, these traditional models cannot compete with deep-learning-based models in capturing dynamic changes in data and solving long-sequence time series forecasting problems [21]. With deep learning models, some researchers have attempted to alleviate the non-stationarity in the original time series by normalizing the global data distribution of related time series into a specific distribution with mean 0 and variance 1, commonly referred to as Z-score standardization [15,22]. Specifically, the core purpose of this normalization method is to transform non-stationary data into stationary data in order to improve prediction accuracy. However, this approach is actually not reasonable, because it affects the ability of the prediction model to capture the time dependency of time series data, which can easily lead to overfitting of the prediction results [23].

To address the issue of non-stationarity in deep learning models due to differences in the statistical properties of input windows, Shen proposed a two-level transformer framework, which exhibited excellent stability [21]. On the other hand, Liu effectively mitigated the overfitting phenomenon caused by non-stationary data such as weather and electricity consumption predictions by optimizing the attention mechanism in the transformer model [24]. Inspired by this, this paper proposes a de-synchronized attention mechanism based on the transformer model after normalizing the data, which approximates the non-stationary information in the original floating data sequence by learning the mean and variance of the original data floating sequence, rather than the normalized statistical information. The main contributions of this paper are as follows:

(1): To enhance the predictive capability of the model, normalization is performed on the input data, and a subsequent denormalization is applied to the output results, ensuring that the predicted outcomes remain consistent with the original data’s scale range.
(2): To prevent the predictive model from overfitting, a de-stationary attention mechanism is constructed to account for the non-stationary information within the original data, replacing the attention mechanism in the vanilla transformer.
(3): Considering the difficulty of learning non-stationary information in the original mineral data, a multi-layer perceptron is used to approximate non-stationary information from raw data and statistical quantities.
(4): To solve the problem of the mismatch between froth videos and XRF detection time points for zinc ore grade values, we automatically match the froth image closest to the XRF detection time point, ensuring the consistency of experimental data.

The remaining sections of this paper are organized as follows: Section 2 presents an example of zinc tailings grade prediction and the input structure for matching mineral grades. The details of normalization, the de-stationary attention mechanism, and denormalization are described in Section 3. THe experimental results and discussions are provided in Section 4, while the conclusions are presented in Section 5.

2. Relevant Theories

2.1. Example of Zinc Mine Tailings Grade Prediction

Figure 1 shows a lead–zinc flotation process in a Chinese factory. Due to the co-occurrence of lead and zinc in the same ore, the process consists of three stages: lead flotation, zinc flotation, and lead–zinc mixed concentrate flotation. Lead flotation tailings serve as the feed for zinc flotation. The zinc flotation circuit includes three-stage roughing machines, three-stage scavenging machines, and three-stage cleaning machines. Lead flotation tailings enter the first rough, where flotation reagents are added. The froth then flows into three-stage cleaners for further flotation, producing zinc concentrate. The slurry is redirected for re-flotation, resulting in a lead–zinc mixed concentrate.

To monitor mineral grade values in real time during flotation, XRF analyzers (Thermo Fisher Scientific (Waltham, MA, USA). at the green triangle location in Figure 1) and three cameras (at First rough, Scavenger III, and Cleaner III sections) were installed. Due to the high cost of the XRF analyzer, only one unit was used to monitor all critical minerals, adopting a sampling multiplexing method. The dataset used here came from video footage captured in the Cleaner III flotation cell.

2.2. Matching of Froth Video and Mineral Grade Values

The camera took a froth video every 1–2 min, lasting about 3 min each. Due to the high cost of the XRF analyzer, monitoring of zinc ore grade values could only be carried out using a sampling multiplexing method. The sampling time interval of the XRF analyzer was not completely fixed, with an approximate sampling interval of 17–18 min, which is much longer than the sampling interval for froth videos. As shown in Figure 2, the timestamp of the froth video feature vector did not match the timestamp of the marked zinc ore grade value, requiring timestamp matching.

Due to the minimal change in the froth state over a short period, this study employed the first frame of a froth video as a representative of the entire froth video status. By utilizing techniques such as the optimized watershed algorithm, multi-color space fusion, and neighboring gray-level dependence matrix [25], we extracted 18 visual features of froth, including the average size of bubbles, mean red values, and energy, to characterize the froth image. Thus, the froth video could be represented using a set of froth visual features:

X = {F_{1}, F_{2}, \dots, F_{f}}

(1)

where f is the total number of froth visual features,

f = 18

. The variables

F_{1}

∼

F_{4}

represent the average bubble size, standard deviation, steepness, and skewness of the bubbles;

F_{5}

∼

F_{9}

denote contrast, correlation, entropy, uniformity, and energy;

F_{10}

∼

F_{12}

indicate the mean values of red, green, and blue;

F_{13}

∼

F_{14}

represent the mean hue and grayscale values;

F_{15}

∼

F_{17}

stand for the relative mean values of red, green, and blue; and

F_{18}

denotes the carrying rate. The definition and calculation method of the relevant features can be found in Appendix A. Huang also encapsulated the explicit significance and computational formulas for characteristics [25].

This paper represents the total number of froth videos and XRF-annotated grade values as

N_{v}

and

N_{l}

, respectively. Let

Δ T_{v}

and

Δ T_{l}

denote the sampling time intervals for froth videos and XRF analyzers. The sampling rate ratio between froth videos and annotated grade values is then represented as N.

N = (Δ T_{v}) / (Δ T_{l})

(2)

Due to the requirement for the marked value to match the number of filmed froth videos (denoted as N), the relationship between the number of input froth videos (froth visual feature vectors) and the number of XRF analysis marked labels can be expressed as

N_{v} {= N}_{l} \times N

(3)

Considering the inability of the initial froth videos

X_{i} (i = 1, 2, \dots, N_{v})

to achieve complete correspondence with the zinc ore grade values

Y_{j} (j = 1, 2, \dots, N_{l})

marked by the XRF analyzer, we opted to match the froth video with the shortest time interval relative to the marked zinc ore grade values using the XRF analyzer. It should be emphasized that the timestamp of the matched flotation video should precede the time point at which the zinc ore grade values were marked by the XRF analyzer [11].

\{\begin{matrix} X_{m}^{'} = a r g m i n (d (t | X_{i}, t | Y_{j})), \\ s . t . t | X_{i} \leq t | Y_{j} . \end{matrix}

(4)

where

X_{m}^{'} (m = j = 1, 2, \dots, N_{l})

is the closest video to the matched floating video

Y_{j}

. The function

d (\cdot)

represents the Euclidean distance function, while

t | X_{i}

denotes the timestamp of the froth video

X_{i}

, and

t | Y_{j}

represents the timestamp of the XRF analyzer marking the grade value

Y_{j}

. The effective input for the matching froth video sequence should be

X^{'} = [X_{1}^{'}, X_{2}^{'}, \dots, X_{N_{l}}^{'}] \in R^{N_{l} \times f}

(5)

3. Methodology

The overall flow chart of this paper is presented in Figure 3. First, we normalized the matched flotation data videos, to enhance the predictive capability of the model. Meanwhile, we stored the mean and variance of the original data, for subsequent learning of relevant non-stationary information within these values. Subsequently, we selected the vanilla transformer model to process the normalized froth video sequences. Specifically, in the encoding layer, we replaced the attention mechanism of the vanilla transformer model with a de-stationary attention mechanism module that considered the original non-stationary information. Approximating non-stationary information from the original data and statistical quantities through multi-layer perceptrons helped to avoid over-fitting in the predictive model. Lastly, to maintain consistency between the predicted mineral grades and the original data scale range, we de-normalized the predicted values to obtain the final prediction sequence. The combination of normalization and de-normalization enhanced the model’s predictive capability, while the de-stationary attention mechanism could capture the time dependence of the data in the flotation process, reducing prediction errors caused by over-stationarity.

3.1. Normalization

To eliminate the scale differences between sequences and enhance the stability of the distribution of visual features in the sequential froth videos from different input points, a sliding window approach was employed to normalize each dimension of the time series data

X_{m}^{'}

[26]. This ensured that the data within adjacent windows possessed identical mean and variance. Following the normalization process, the resulting normalized input sequence was denoted as

X^{*} = [X_{1}^{*}, X_{2}^{*}, \dots, X_{N_{l}}^{*}]

.

The specific calculation formulas for the mean value

μ_{X^{'}}

, variance

σ_{X^{'}}^{2}

, original input data

X_{m}^{'}

, and normalized data

X_{m}^{*}

of the original matched input data sequence X are respectively shown in Equations (6) to (8):

μ_{X^{'}} = \frac{1}{N_{l}} \sum_{m = 1}^{N_{l}} X_{m}^{'} .

(6)

σ_{X^{'}}^{2} = \frac{1}{N_{l}} \sum_{m = 1}^{N_{l}} {(X_{m}^{'} - μ_{X^{'}})}^{2}

(7)

X_{m}^{*} = \frac{1}{σ_{X^{'}}} ⨀ (X_{m}^{'} - μ_{X^{'}})

(8)

where ⨀ represents the Hadamard product. Such standardized operations could reduce the distributional differences between the floating video time series inputs, thereby making the input distribution of the model more stable.

However, on the other hand, over-stationary processing results in the loss of intrinsic non-stationary information in time series, rendering a model unable to capture the time-dependent nature of mineral grade values in the flotation process, and thus affecting the accuracy of grade prediction outcomes.

To demonstrate the non-stationarity of the flotation data, we chose to display the time series data of the bubble size in the first frame of the froth video detected by the XRF analyzer, with a total of 800 data points and a time interval of 17–18 min between each data point, as shown in Figure 4. The red solid line and black dashed line in the figure represent the average bubble size and variance during the time period, respectively. From Figure 4, it can be observed that the data curve does not have a clear trend, and the mean and variance show different variations. Consequently, this study proposed modifying the attention mechanism of the vanilla transformer model into a de-stationary attention mechanism, aiming to approximate the attention learning in the original non-stationary sequences.

3.2. De-Stationary Attention Mechanism

In the baseline transformer model, the attention mechanism was calculated as follows [13]:

A t t n (Q, K, V) = S o f t m a x ((Q K^{⊤}) / \sqrt{d_{k}})

(9)

where

Q = {[q_{1}, q_{2}, \dots, q_{N_{l}}]}^{⊤}, K = {[k_{1}, k_{2}, \dots, k_{N_{l}}]}^{⊤}, V = {[v_{1}, v_{2}, \dots, v_{N_{l}}]}^{⊤}

, respectively, represent the query, key, and value in the attention mechanism.

Assuming that the model’s embedding layer and forward propagation layer maintain linear characteristics in the temporal dimension, the Q, K, and V values in the attention mechanism can be calculated through a linear function f [24]. For the normalized input sequence

X^{*} = (X^{'} - 1 μ_{X^{'}}^{⊤}) / σ_{X^{'}}

of the floating video, where

1 \in R^{N_{l} \times 1}

is a vector consisting of all ones. The formula for calculating

Q^{*}

in the attention mechanism is as follows:

Q^{*} = [\begin{matrix} q_{1}^{*} \\ q_{2}^{*} \\ ⋮ \\ q_{N_{l}}^{*} \end{matrix}] = [\begin{matrix} f (X_{1}^{*}) \\ f (X_{2}^{*}) \\ ⋮ \\ f (X_{N_{l}}^{*}) \end{matrix}] = [\begin{matrix} f (\frac{X_{1}^{'} - 1 μ_{X_{1}^{'}}^{⊤}}{σ_{X_{1}^{'}}}) \\ f (\frac{X_{2}^{'} - 1 μ_{X_{2}^{'}}^{⊤}}{σ_{X_{2}^{'}}}) \\ ⋮ \\ f (\frac{X_{N_{l}}^{⊤} - 1 μ_{{N_{l}}^{'}}^{⊤}}{σ_{{N_{l}}^{'}}}) \end{matrix}] = [\begin{matrix} \frac{f (X_{1}^{'}) - 1 f (μ_{X_{1}^{'}}^{⊤})}{σ_{X_{1}^{'}}}) \\ \frac{f (X_{2}^{'}) - 1 f (μ_{X_{2}^{'}}^{⊤})}{σ_{X_{2}^{'}}}) \\ ⋮ \\ \frac{f (X_{N_{l}}^{'}) - 1 f (μ_{X_{N_{l}}^{'}}^{⊤})}{σ_{X_{N_{l}}^{'}}}) \end{matrix}] = \frac{Q - 1 μ_{Q}^{⊤}}{σ_{X^{'}}}

(10)

It is noteworthy that all input feature variables underwent normalization processing, thus ensuring that each variable in the floating video possessed the same variance, which is a constant, denoted as

σ_{X_{1}^{'}} = σ_{X_{2}^{'}} = \dots = σ_{X_{N_{l}}^{'}} = σ_{X^{'}}

.

μ_{Q} \in R^{d_{k} \times 1}

represents the average value of Q in the temporal dimension, and similar conclusions can be drawn for

K^{*}

and

V^{*}

. The equation can be derived as follows:

Q^{*} K^{* ⊤} = (Q K^{⊤} - 1 (μ_{Q}^{⊤} K^{⊤}) - (Q μ_{K}) 1^{⊤} + 1 (μ_{Q}^{⊤} μ_{K}) 1^{⊤}) / σ_{X^{'}}^{2}

(11)

Therefore, the current attention mechanism is calculated as follows:

S o f t m a x (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) = S o f t m a x (\frac{σ_{X^{'}}^{2} Q^{*} K^{* ⊤} + 1 (μ_{Q}^{⊤} K^{⊤}) + (Q μ_{K}) 1^{⊤} - 1 (μ_{Q}^{⊤} μ_{K}) 1^{⊤}}{\sqrt{d_{k}}})

(12)

According to the shift-invariance property of the Softmax operator [13], Equation (12) can be simplified to Equation (13).

S o f t m a x (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) = S o f t m a x (\frac{σ_{X^{'}}^{2} Q^{*} K^{* ⊤} + 1 (μ_{Q}^{⊤} K^{⊤})}{\sqrt{d_{k}}})

(13)

From the above formula, it can be inferred that the attention calculation in the current transformer model considers not only the

Q^{*}

and

K^{*}

obtained from the stabilized froth video sequence

X^{*}

, but also

σ_{X^{'}}

,

μ_{Q}

, K from the original froth sequence. This approach significantly mitigated the tendency of the baseline transformer model to predict overly stable mineral grade values.

The key to finding non-stationary factors in the original data is to approximate

α = σ_{X^{'}}^{2}

and

β = K μ_{Q}

. However, due to the presence of a nonlinear activation layer in fully connected feedforward neural networks, strict linear properties are difficult to apply in the vanilla transformer model. To weaken the linear assumptions between model layers, this paper employed multi-layer perceptrons (MLP) to directly learn adaptive de-stationary factors from the unsteady correlated statistical data

X^{'}

, Q, and K in the froth video sequences. By learning de-stationary factors, the model could better capture essential features in the data, thereby improving the performance. However, we could only learn limited non-stationary information from Q and K, thus necessitating the learning of non-stationary factors from the original froth video sequence

X^{'}

. The formula for de-stationary attention is as follows:

l o g α = M L P (σ_{X^{'}}, X^{'})

(14)

β = M L P (μ_{X^{'}}, X^{'})

(15)

A t t n (Q^{*}, K^{*}, V^{*}, α, β) = S o f t m a x (\frac{α Q^{*} K^{* ⊤} + 1 β^{⊤}}{\sqrt{d_{k}}}) V^{*}

(16)

The de-stationary of the attention mechanism design could facilitate learning non-stationary factors from the original floating video data

X^{'}

,

μ_{X^{'}}

, and

σ_{X^{'}}

.

Following the introduction of de-stationary attention mechanism in the encoding stage of the baseline transformer model, we conducted predictions for ore grade.

To enhance predictive accuracy during the model training phase, we normalized the mineral flotation data.

3.3. De-Normalization

The mineral grade prediction results obtained during the model testing phase are denoted as

Y_{g}^{*} (g = 1, 2, \dots, N_{l})

, and the sequence of prediction results is labeled as

Y^{*} = [Y_{1}^{*}, Y_{2}^{*}, \dots Y_{N_{l}}^{*}]

.

To maintain the predicted results within the same scale and range as the original data, the final flotation outcome

Y_{g}^{'}

was obtained by performing inverse normalization [27] on the grade prediction result

Y_{g}^{*} (g = 1, 2, \dots, N_{l})

, based on

σ_{X^{'}}

and

μ_{X^{'}}

.

Y_{g}^{'} = σ_{X^{'}} ⨀ (Y_{g}^{*} + μ_{X^{'}}) .

(17)

The final predicted sequence is denoted as

Y^{'} = [Y_{1}^{'}, Y_{2}^{'}, \dots Y_{N_{l}}^{'}]

. Through the conversion processes of normalization and de-normalization, the model achieved improved predictive outcomes, while becoming more aligned with the actual grade values.

4. Experiment and Result Analysis

4.1. Dataset

The experimental data presented in this study originated from a lead–zinc flotation plant in Guangdong, China. The dataset was collected between 9 September 2020 and 16 October 2020. The total number of samples amounted to 3004, with a resolution of 692 × 518 pixels for the froth images. The dataset used in this study can be accessed online at http://dx.doi.org/10.21227/q1q5-d663 (accessed on 24 March 2023).

During the experimental process, the dataset was randomly divided into three portions: the training set accounted for 60% and was utilized for model training; the validation set accounted for 20% and was employed for parameter selection; and the testing set accounted for 20% for assessing the performance of the mineral grade prediction model. The performance of the testing set was used as the vanilla standard to evaluate the predictive ability of the model.

This paper employed root mean square error (RMSE), mean absolute error (MAE), and determination coefficient (R²) as evaluation metrics for the model, with their respective calculation formulas presented below:

R M S E = \sqrt{\frac{1}{N_{l}} \sum_{i = 1}^{N_{l}} {(Y_{i} - Y_{i}^{'})}^{2}}

(18)

M A E = \frac{1}{N_{l}} \sum_{i = 1}^{N_{l}} | Y_{i} - Y_{i}^{'} |

(19)

R^{2} = 1 - \frac{\sum_{i = 1}^{N_{l}} {(Y_{i} - Y_{i}^{'})}^{2}}{\sum_{i = 1}^{N_{l}} {(Y_{i} - \bar{Y_{i}^{'}})}^{2}}

(20)

In this context,

Y_{i}

and

Y_{i}^{*}

represent the actual ore grade measured by the XRF analyzer and the predicted tailing grade value, respectively.

\bar{Y_{i}^{*}}

denotes the average ore grade value predicted by the model, and

N_{l}

refers to the total number of samples.

4.2. Hyperparameter Selection

The choice of hyperparameters has a significant impact on the predictive performance of a model. In the mineral grade prediction model proposed in this study, the batch size

b s

, learning rate

l_{r}

, sampling time step T, and number of hidden units N were crucial hyperparameters that required careful selection.

To identify the optimal hyperparameters, a grid search strategy [28] was employed on the validation set for selection. The grid search strategy is an efficient optimization algorithm that divides the search space into a grid. It quickly eliminates impossible solutions, narrows down the search range, and improves efficiency. Meanwhile, tt also has a good parallel processing capability, making it faster and effectively utilizing computer resources. The batch size was chosen in group

\{1, 2, 4, 8, 16\}

; the learning rate was selected in group

\{0.1, 0.01, 0.001, 0.0001\}

; the sampling time step was decided in group

\{1, 3, 5, 7, 10\}

; and the number of hidden units was determined in group

\{1, 2, 4, 8, 16\}

. Optimal prediction performance is achieved when the loss is minimized, with the loss on the validation set serving as the criterion. The prediction model was trained on the training dataset and validated on the validation dataset. The validation process was randomly executed 50 times, and the average of the validation losses is presented in the Figure 5. When

b s = 2

,

l_{r} = 0.001

,

T = 10

and

N = 10

, the validation loss was minimal.

To validate the effectiveness of the method proposed in this paper, we compared the proposed model, the DST model, with other state-of-the-art models, including the RNN model [7], the LSTM model [8], the transformer model [13], the Enc-Dec (RNN) model [11], and the STS-D model [9]. For a fair comparison, we selected optimal hyperparameters for the proposed model and these networks on the validation set through a grid search. The optimal hyperparameters are recorded in Table 1.

As shown in Table 1, except for the STS-D model, the other models achieved the minimum validation loss at a batch size of 2, while the STS-D model achieved the minimum at a batch size of 8. Except for the STS-D model, which had the minimum validation loss at a learning rate of 0.0001, the other models achieved the lowest validation loss at a learning rate of 0.001. When the learning rate is too low, models tend to fall into local optima. The optimal sampling step size for most models ranged between 5 and 10, with the hidden unit count typically being 20.

After conducting 50 random test dataset runs, each comparative model took an average value as the experimental result. Table 2 displays a comparison of the evaluation index data among the various models, with the DST model exhibiting the best performance for all evaluation metrics. Compared to the RNN, LSTM, transformer, Enc-Dec (RNN), and STS-D models, the DST model’s RMSE was reduced by 20.8%, 20.8%, 62.8%, 20.5%, and 49.1%, respectively. Considering the non-stationary information of the original mineral data, the DST model’s MAE values were respectively 10.8%, 10.8%, 54.3%, 11.3%, and 41.2% lower than those of the other models. Furthermore, the DST model achieved the highest

R^{2}

value, indicating a better fitting efficacy and predicted values closer to the actual mineral grade. Compared to the other five models, the DST model’s values increased by 0.0116, 0.0116, 0.0669, 0.0075, and 0.0305, respectively.

The above evidence indicates that the mineral grade prediction model proposed in this study, which considered non-stationary information in the original mineral data, exhibited better accuracy compared to other models that did not consider this information. This is primarily because the proposed grade prediction model not only enhances the predictability of the mineral grade sequence by utilizing a combination of normalization and de-normalization modules but also takes into account the non-stationary information present in the original mineral data, thereby reducing the occurrence of overfitting due to excessive smoothing.

To further demonstrate the effectiveness of the proposed model, the predicted values of all models were compared with the true values on the test data. As illustrated in Figure 6, the blue line represents the observed grade values measured by the XRF analyzer, while the red line denotes the predicted values from the model. As can be seen more intuitively from Figure 6, the fitting effect of the method proposed in this paper was superior to the other comparison methods. Specifically, compared with the RNN model, LSTM model, and Enc-Dec (RNN) model, the method proposed by us had a higher accuracy and stability of fitting effect; compared with the STS-D model, our method had better dynamic adaptability in predicting non-stationary time series; and in comparison with the transformer model, the method proposed by us exhibited stronger robustness in dealing with non-stationary data.

5. Conclusions

This study enhanced the predictive capability of the model by normalizing matched flotation data videos, and introducing a de-stationary attention mechanism module that considered original non-stationary information, replacing the attention mechanism of the vanilla transformer model. Furthermore, the study saved the mean and variance of the original data to learn relevant non-stationary information and approximated non-stationary information through multi-layer perceptrons to prevent overfitting of the predictive model. After obtaining the predicted values of mineral grades, the final predicted sequence was obtained through inverse normalization. In the real-time dataset experiment, the proposed mineral grade prediction method demonstrated the best fitting effect. This confirmed the effectiveness of the combination of normalization and de-normalization, as well as the destabilizing attention mechanism, which enabled the model to capture the time-dependency of data in the flotation process, thereby reducing prediction errors caused by over-stationarity. In addition, since the dataset used in this method only covered the froth image feature data of zinc ore in a specific period, in future research, we will consider the characteristics of the feed in the flotation process, to make the proposed method more universal and widely applicable.

Author Contributions

Conceptualization, C.P. and L.L.; methodology, C.P.; software, H.L.; validation, H.L., C.P. and L.L.; formal analysis, C.P. and L.L.; investigation, C.P. and H.L.; resources, L.L. and Z.T.; data curation, L.L. and Z.T.; writing—original draft preparation, C.P. and L.L.; writing—review and editing, C.P.; visualization, H.L. and L.L.; supervision, C.P. and Z.T.; project administration, C.P.; funding acquisition, Z.T. and C.P. All authors have read and agreed to the published version of the manuscript.

Funding

The Natural Science Foundation of Hunan Province (No.2023JJ50200), Key project of Hunan Provincial Education Department (22A0390).

Data Availability Statement

The datasets used in this study are available at http://dx.doi.org/10.21227/q1q5-d663 (accessed on 24 March 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Definitions and Calculation Methods of Froth Image Features

The average bubble size

f_{1}

: This refers to the ratio of the area of froth regions to the number of froth regions within a certain range. This parameter can be used to evaluate the stability and effectiveness of the flotation process, providing a basis for monitoring and controlling the flotation production conditions.

Calculation method: By using image segmentation techniques (such as watershed algorithm, threshold segmentation, etc.) to segment the bubbles in flotation froth images, the edges of each bubble can be obtained, thereby obtaining the number of bubbles in the image. By calculating the number of pixels in the image, the average size of the bubbles can be calculated.

The standard deviation of bubbles

f_{2}

: This refers to the degree of variation in the size of bubbles in flotation froth. It is used to describe the dispersion of bubble sizes in flotation froth, which can reflect the uniformity or heterogeneity of the froth. A smaller standard deviation indicates a more uniform bubble size, while a larger standard deviation indicates a greater variability in bubble sizes.

Calculation method: By using image segmentation techniques to segment the bubbles in flotation froth images, we can calculate the size of each bubble. Next, we calculate the standard deviation of the bubble sizes with respect to the average bubble size in the entire froth image, thereby obtaining a characteristic of the froth image, which is the standard deviation of bubble sizes.

The steepness of bubbles

f_{3}

: This refers to the rate of change in bubble size in froth images. The greater the steepness, the more dramatic the change in bubble size, which may indicate larger size differences. The smaller the steepness, the more gradual the change in bubble size, which may indicate a more consistent size distribution.

Calculation method: The curvature is obtained by calculating the second-order derivative of the surface curve composed of each bubble size. Then, further processing of the curvature can obtain the steepness of the bubble.

The skewness of bubbles

f_{4}

: A statistic used to measure the degree of skewness of the bubble size distribution, which can help understand the bias of the bubble size distribution. The closer the calculation result is to zero, the closer the data distribution is to symmetry; the closer it is to positive numbers, the more right-skewed the data distribution; and the closer it is to negative numbers, the more left-skewed the data distribution.

Calculation method: Calculating the cubic difference between the bubble size and the average value of the bubble size, and then summing up the results and dividing them by the cubic root of the standard deviation of the bubble size.

Contrast

f_{5}

: This describes the degree of brightness difference between adjacent bubbles or within a bubble image. A higher contrast indicates a more pronounced difference in brightness between bubbles.

Calculation method: This is obtained by calculating the standard deviation of the gray-level co-occurrence matrix (GLCM). First, calculate the GLCM, then normalize the GLCM by dividing each element value by the sum of all element values in the GLCM, resulting in a normalized frequency matrix. Based on this, the contrast of the bubble image can be calculated.

Correlation

f_{6}

: This describes the consistency of brightness changes between adjacent bubbles in an image. The higher the correlation, i.e., the closer the correlation value is to 1, the more consistent the brightness change trends between the bubbles.

Calculation method: The standard deviation of the grayscale co-occurrence matrix (GLCM) is used to obtain this parameter. First, calculate the GLCM, then normalize the GLCM by dividing the element values of the GLCM by the sum of all element values in the GLCM, obtaining a normalized frequency matrix. Based on this, calculate the correlation of the bubble image.

Entropy

f_{7}

: This refers to the uniformity or randomness of the distribution of grayscales in bubble images. The larger the entropy value, the more uniform the distribution of grayscales in the bubble image, and the more information the image contains; the smaller the entropy value, the less uniform the distribution of grayscales in the bubble image, and the less information the image contains.

Calculation method: First, convert the bubble image into a grayscale image; then, count the occurrence frequency of each grayscale level; finally, calculate the entropy value based on the frequency.

Uniformity

f_{8}

: This refers to the uniformity of the pixel grayscale distribution in a bubble image.

Calculation method: First, calculate the average grayscale value of the entire image; then, calculate the difference between the grayscale values of each bubble region and the average grayscale value; finally, by analyzing the magnitude and frequency of these differences, we can obtain the uniformity characteristic.

Energy

f_{9}

: This refers to the concentration of the pixel grayscale distribution in an image. A lower energy value indicates a more uniform distribution of objects or features in the image, while a higher energy value indicates a more concentrated distribution.

Calculation method: Calculate the gradient magnitude or grayscale value of each pixel and accumulate them to obtain the energy value of the entire image.

The mean of red

f_{10}

: This refers to the average pixel value of the red channel in a bubble image.

Calculation method: Separate the red, green, and blue channels; iterate through the red channel; and calculate the sum of pixel values in the red channel. Finally, divide the sum of all pixel values in the red channel by the total number of pixels in the red channel to obtain the mean of the red channel.

The mean of green

f_{11}

: This refers to the average pixel value of the green channel in a bubble image.

Calculation method: Separate the red, green, and blue channels; iterate through the green channel; and calculate the sum of pixel values in the green channel. Finally, divide the sum of all pixel values in the green channel by the total number of pixels in the green channel, obtaining the mean of the green channel.

The mean of blue

f_{12}

: This refers to the average pixel value of the blue channel in a bubble image.

Calculation method: Separate the red, green, and blue channels; iterate through the blue channel; and calculate the sum of pixel values in the blue channel. Finally, divide the sum of all pixel values in the blue channel by the total number of pixels in the blue channel, obtaining the mean of the blue channel.

The mean of hue

f_{13}

: This refers to the overall color tone of the froth bubbles in a bubble image.

Calculation method: First, convert the image to the HSV (hue, saturation, value) color space. Then, iterate through all the pixels in the image and for each pixel, obtain the H value. Finally, divide the sum of all the hue values by the total number of pixels in the image, obtaining the mean hue.

The mean of grayscale

f_{14}

: This refers to the average value of all pixel intensities in the image. This can be used to describe the brightness characteristic of the image, reflecting the overall grayscale level of the froth bubble image.

Calculation method: First, convert the image into a grayscale image. Then, iterate through all the pixels in the grayscale image and calculate the sum of all pixel intensities. Finally, divide the sum of intensities by the total number of pixels, obtaining the mean grayscale value.

The relative mean of red

f_{15}

: This refers to the average value of the red channel in the image relative to the overall range of pixel values in the image. This can be used to describe the relative distribution of red color in the image, rather than just the absolute average.

Calculation method: First, obtain the red channel values for each pixel. Then, calculate the average of all red channel values and the range of pixel values in the image (maximum value minus the minimum value). Finally, divide the average red channel value by the pixel value range.

The relative mean of green

f_{16}

: This refers to the average value of the green channel in the image relative to the overall range of pixel values in the image. It can be used to describe the relative distribution of green color in the image.

Calculation method: First, obtain the green channel values for each pixel. Then, calculate the average of all green channel values and the range of pixel values in the image (maximum value minus the minimum value). Finally, divide the average green channel value by the pixel value range.

The relative mean of blue

f_{17}

: This refers to the average value of the blue channel in the image relative to the overall range of pixel values in the image. This can be used to describe the relative distribution of blue color in the image.

Calculation method: First, obtain the blue channel values for each pixel. Then, calculate the average of all blue channel values and the range of pixel values in the image (maximum value minus the minimum value). Finally, divide the average blue channel value by the pixel value range.

The carrying rate

f_{18}

: This refers to the ratio of the actual area carrying minerals in a froth image to the total area. This can reflect the concentration and distribution of minerals during the flotation process. A higher carrying rate indicates a larger proportion of minerals distributed in the froth, indicating a better flotation efficiency.

Calculation method: Since bubbles with a small amount of attached minerals always have a total reflection point, the carrying rate can be calculated by dividing the sum of the areas of bubbles without total reflection points by the total number of pixels in the entire image.

References

Bendaouia, A.; Qassimi, S.; Boussetta, A.; Benzakour, I.; Benhayoun, A.; Amar, O.; Bourzeix, F.; Baïna, K.; Cherkaoui, M.; Hasidi, O.; et al. Hybrid features extraction for the online mineral grades determination in the flotation froth using Deep Learning. Eng. Appl. Artif. Intell. 2024, 129, 107680. [Google Scholar] [CrossRef]
Nkadimeng, M.; Manono, M.S.; Corin, K.C. Developing a Relationship between Ore Feed Grade and Flotation Performance. Eng. Proc. 2023, 37, 101. [Google Scholar]
Yuan, X.; Gu, Y.; Wang, Y.; Yang, C.; Gui, W. A deep supervised learning framework for data-driven soft sensor modeling of industrial processes. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 4737–4746. [Google Scholar] [CrossRef] [PubMed]
Bhondayi, C. Flotation froth phase bubble size measurement. Miner. Process. Extr. Metall. Rev. 2022, 43, 251–273. [Google Scholar] [CrossRef]
Xie, Y.; Wu, J.; Xu, D.; Yang, C.; Gui, W. Reagent addition control for stibium rougher flotation based on sensitive froth image features. IEEE Trans. Ind. Electron. 2016, 64, 4199–4206. [Google Scholar] [CrossRef]
Popli, K.; Afacan, A.; Liu, Q.; Prasad, V. Development of online soft sensors and dynamic fundamental model-based process monitoring for complex sulfide ore flotation. Miner. Eng. 2018, 124, 10–27. [Google Scholar] [CrossRef]
Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent neural network regularization. arXiv 2014, arXiv:1409.2329. [Google Scholar]
Memory, L.S.T. Long short-term memory. Neural Comput. 2010, 9, 1735–1780. [Google Scholar]
Zhang, H.; Tang, Z.; Xie, Y.; Yuan, H.; Chen, Q.; Gui, W. Siamese time series and difference networks for performance monitoring in the froth flotation process. IEEE Trans. Ind. Informatics 2021, 18, 2539–2549. [Google Scholar] [CrossRef]
Zhang, H.; Tang, Z.; Xie, Y.; Gao, X.; Chen, Q.; Gui, W. Long short-term memory-based grade monitoring in froth flotation using a froth video sequence. Miner. Eng. 2021, 160, 106677. [Google Scholar] [CrossRef]
Zhang, H.; Tang, Z.; Xie, Y.; Luo, J.; Chen, Q.; Gui, W. Grade prediction of zinc tailings using an encoder-decoder model in froth flotation. Miner. Eng. 2021, 172, 107173. [Google Scholar] [CrossRef]
Peng, C.; Liu, Y.; Ouyang, Y.; Tang, Z.; Luo, L.; Gui, W. Grade Prediction of Froth Flotation Based on Multistep Fusion Transformer Model. IEEE Trans. Ind. Inform. 2023. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar]
Woo, G.; Liu, C.; Sahoo, D.; Kumar, A.; Hoi, S. Etsformer: Exponential smoothing transformers for time-series forecasting. arXiv 2022, arXiv:2202.01381. [Google Scholar]
Priestley, M.; Rao, T.S. A test for non-stationarity of time-series. J. R. Stat. Soc. Ser. B Stat. Methodol. 1969, 31, 140–149. [Google Scholar] [CrossRef]
Suazo, C.; Kracht, W.; Alruiz, O. Geometallurgical modelling of the Collahuasi flotation circuit. Miner. Eng. 2010, 23, 137–142. [Google Scholar] [CrossRef]
Alruiz, O.; Morrell, S.; Suazo, C.; Naranjo, A. A novel approach to the geometallurgical modelling of the Collahuasi grinding circuit. Miner. Eng. 2009, 22, 1060–1067. [Google Scholar] [CrossRef]
Kim, T.; Kim, J.; Tae, Y.; Park, C.; Choi, J.H.; Choo, J. Reversible instance normalization for accurate time-series forecasting against distribution shift. In Proceedings of the International Conference on Learning Representations, Virtually, 3–7 May 2021. [Google Scholar]
Zhang, M. Time Series: Autoregressive Models AR, MA, ARMA, ARIMA; University of Pittsburgh: Pittsburgh, PA, USA, 2018. [Google Scholar]
Shen, L.; Wei, Y.; Wang, Y. GBT: Two-stage transformer framework for non-stationary time series forecasting. Neural Netw. 2023, 165, 953–970. [Google Scholar] [CrossRef] [PubMed]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv. Neural Inf. Process. Syst. 2021, 34, 22419–22430. [Google Scholar]
Salles, R.; Belloze, K.; Porto, F.; Gonzalez, P.H.; Ogasawara, E. Nonstationary time series transformation methods: An experimental review. Knowl.-Based Syst. 2019, 164, 274–291. [Google Scholar] [CrossRef]
Liu, Y.; Wu, H.; Wang, J.; Long, M. Non-stationary transformers: Exploring the stationarity in time series forecasting. Adv. Neural Inf. Process. Syst. 2022, 35, 9881–9893. [Google Scholar]
Huang, M.; Zhou, X. Feature selection in froth flotation for production condition recognition. IFAC-PapersOnLine 2018, 51, 123–128. [Google Scholar]
Tanaka, T.; Nambu, I.; Maruyama, Y.; Wada, Y. Sliding-Window Normalization to Improve the Performance of Machine-Learning Models for Real-Time Motion Prediction Using Electromyography. Sensors 2022, 22, 5005. [Google Scholar] [CrossRef]
Hiransha, M.; Gopalakrishnan, E.A.; Menon, V.K.; Soman, K. NSE stock market prediction using deep-learning models. Procedia Comput. Sci. 2018, 132, 1351–1362. [Google Scholar]
Liashchynskyi, P.; Liashchynskyi, P. Grid search, random search, genetic algorithm: A big comparison for NAS. arXiv 2019, arXiv:1912.06059. [Google Scholar]

Figure 1. Process of zinc flotation circuit.

Figure 2. Matching flotation video with time points of XRF analyzer.

Figure 3. Overall flow chart.

Figure 4. Non-stationary time series of bubble size.

Figure 5. Validation of loss under different parameters: (a) batch size; (b) learning rate; (c) time step; (d) hidden units.

Figure 6. Comparison of testing results on the test set for the six models: (a) DST model; (b) RNN model; (c) LSTM model; (d) Transformer model; (e) Enc-Dec (RNN) model; (f) STS-D model.

Table 1. Optimal hyperparameters for all models.

Method	bs	$l_{r}$	T	N
DST	2	0.001	10	10
RNN	2	0.001	7	20
LSTM	2	0.001	7	20
Transformer	2	0.001	1	10
Enc-Dec(RNN)	2	0.001	5	10
STS-D	8	0.0001	10	10

Table 2. Prediction results for the different models. Data format: mean ± (standard deviation).

Method	RMSE	MAE	$R^{2}$
DST	0.1582 ± (0.0043)	0.3955 ± (0.0025)	0.8978 ± (0.0203)
RNN	0.1998 ± (0.0001)	0.4436 ± (0.0002)	0.8862 ± (0.0001)
LSTM	0.1998 ± (0.0001)	0.4436 ± (0.0001)	0.8862 ± (0.0001)
Transformer	0.4216 ± (0.1112)	0.8650 ± (0.1981)	0.8309 ± (0.0203)
Enc-Dec(RNN)	0.1990 ± (0.0001)	0.4457 ± (0.0002)	0.8903 ± (0.0001)
STS-D	0.3110 ± (0.1050)	0.6711 ± (0.2051)	0.8673 ± (0.0205)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Peng, C.; Luo, L.; Luo, H.; Tang, Z. Study on Prediction of Zinc Grade by Transformer Model with De-Stationary Mechanism. Minerals 2024, 14, 230. https://doi.org/10.3390/min14030230

AMA Style

Peng C, Luo L, Luo H, Tang Z. Study on Prediction of Zinc Grade by Transformer Model with De-Stationary Mechanism. Minerals. 2024; 14(3):230. https://doi.org/10.3390/min14030230

Chicago/Turabian Style

Peng, Cheng, Liang Luo, Hao Luo, and Zhaohui Tang. 2024. "Study on Prediction of Zinc Grade by Transformer Model with De-Stationary Mechanism" Minerals 14, no. 3: 230. https://doi.org/10.3390/min14030230

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Study on Prediction of Zinc Grade by Transformer Model with De-Stationary Mechanism

Abstract

1. Introduction

2. Relevant Theories

2.1. Example of Zinc Mine Tailings Grade Prediction

2.2. Matching of Froth Video and Mineral Grade Values

3. Methodology

3.1. Normalization

3.2. De-Stationary Attention Mechanism

3.3. De-Normalization

4. Experiment and Result Analysis

4.1. Dataset

4.2. Hyperparameter Selection

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Definitions and Calculation Methods of Froth Image Features

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI