2.1.2. The Structure of GRU Cell

The GRU is a refined version of LSTM with a simpler structure [19]. The main difference between GRU and LSTM is in the process of forgetting and updating cell values. In the LSTM network, update of cell values are controlled by two gates, the forget gate and the input gate. Since two gate structures are required, the structure of LSTM is relatively complex. Compared to LSTM, GRU controls both the forgetting coefficient and the update coefficient for the output with one single update gate, so it involves fewer matrix multiplication calculations. Through this simplification, the GRU can retain the functions of the LSTM and reduce network training time. More specifically, it consists of an update gate and a reset gate, which reduces the number of parameters to only one fourth of the LSTM. The reset gate determines how much previous memory is retained and the update gate determines how much new information needs to be combined with the previous memory. The structure of GRU cell is shown in Figure 2.

**Figure 2.** Structure diagram of Gated Recurrent Unit (GRU) cell.

In contrast to LSTM, GRU has only 2 gate functions. The update gate is shown in the blue box in Figure 2 and the reset gate is shown in the red box. The forward transfer formulations of GRU can be calculated as follows.

$$\mathbf{r}\_t = \sigma\left(\mathbf{W}\_r \cdot \left[\mathbf{h}\_{t-1}, \mathbf{x}\_t\right]\right) \tag{7}$$

$$\mathbf{z}\_t = \sigma\left(\mathbf{W}\_z \cdot \left[\mathbf{h}\_{t-1}, \mathbf{x}\_t\right]\right) \tag{8}$$

$$\tilde{\mathbf{h}}\_{l} = \tanh\left(\mathbf{W}\_{\tilde{l}} \cdot [\mathbf{r}\_{l} \cdot \mathbf{h}\_{t-1}, \mathbf{x}\_{l}]\right) \tag{9}$$

$$\mathbf{h}\_{t} = (1 - \mathbf{z}\_{t}) \cdot \mathbf{h}\_{t-1} + \mathbf{z}\_{t} \cdot \mathbf{\tilde{h}}\_{t} \tag{10}$$

where **r***<sup>t</sup>* is the reset gate determining how much information in the previous state cell should be forgotten; **z***t* is the update gate determining how much information should be brought to the next cell; **h**˜*<sup>t</sup>* is the intermediate state; **h***<sup>t</sup>* is hidden state. For the update gate, a greater value of **z***<sup>t</sup>* means that

more new information is brought to the next cell. For the reset gate, a greater value means that more information from the former cell may be ignored [16].

#### *2.2. Support Vector Data Description*

SVDD is a kernel method, which maps the data samples into the high-dimensional feature space through a non-linear mapping. In the high-dimensional feature space, a compact hypersphere with the minimum radium while covering the maximum number of data samples is obtained by solution of an optimization problem. The SVDD is generally used in anomaly detection. If a new sample is mapped inside the hypersphere, it is regarded as a normal sample, otherwise it is faulty.

Given a data set **<sup>x</sup>***<sup>i</sup>* ∈ R*d*, *<sup>i</sup>* = 1, ···, *<sup>N</sup>* and assume **<sup>a</sup>** ∈ R*<sup>d</sup>* the center of the hypersphere, *R* is the radius of the hypersphere, the following objective function can be obtained for SVDD.

$$\begin{cases} \quad \boldsymbol{F} \left( \boldsymbol{R}, \mathbf{a}, \boldsymbol{\zeta}\_{i} \right) = \boldsymbol{R}^{2} + \mathbb{C} \sum\_{i=1}^{N} \boldsymbol{\zeta}\_{i} \\ \quad \left\| \mathbf{x}\_{i} - \mathbf{a} \right\|^{2} \leq \boldsymbol{R}^{2} + \boldsymbol{\zeta}\_{i} \end{cases} \tag{11}$$

Here, *ξ<sup>i</sup>* is the relaxation factor, and *C* is the penalty parameter. In Equation (11), *ξ<sup>i</sup>* satisfies *ξ<sup>i</sup>* ≥ 0, ∀*i*. The above optimization problem can be transformed as follows using the Lagrangian multipliers.

$$L\left(R, \mathbf{a}, a, \gamma\_i, \xi\_i\right) = R^2 + C \sum\_{i=1}^{N} \xi\_i - \sum\_{i=1}^{N} \gamma\_i \xi\_i - \sum\_{i=1}^{N} a\_i \left[R^2 + \xi\_i - \left(\|\mathbf{x}\_i\|^2 - 2\mathbf{a}\mathbf{x}\_i + \|\mathbf{a}\|^2\right)\right] \tag{12}$$

where *γ<sup>i</sup>* and *α<sup>i</sup>* is the Lagrange multiplier and they satisfy *α<sup>i</sup>* ≥ 0, *γ<sup>i</sup>* ≥ 0. Differentiate Equation (12) with respect to *R*, **a** and *ξ<sup>i</sup>* and make it equal to 0, the following holds:

$$\begin{cases} \frac{\partial L}{\partial \mathbf{\tilde{s}}^{\rm N}} = 0\\ \frac{\partial L}{\partial \mathbf{\tilde{a}}} = 0 \implies \begin{cases} \sum\_{i=1}^{N} a\_i = 1\\ \mathbf{a} = \sum\_{i=1}^{N} a\_i \mathbf{x}\_i\\ \mathbf{C} - a\_i - \gamma\_i = 0 \end{cases} \end{cases} \tag{13}$$

Combining Equation (13) to Equation (12) one can obtain:

$$L = \sum\_{i=1}^{N} \alpha\_i \left(\mathbf{x}\_i \cdot \mathbf{x}\_i\right) - \sum\_{i=1}^{N} \sum\_{j=1}^{N} \alpha\_i \alpha\_j \left(\mathbf{x}\_i \cdot \mathbf{x}\_j\right) \tag{14}$$

where *α<sup>i</sup>* is the support vector and 0 ≤ *α<sup>i</sup>* ≤ *C*. Generally, kernel function *K* is used to calculate whether the distance between the new sample **<sup>y</sup>** ∈ R*<sup>d</sup>* and the center of the hypersphere is less than the radius *R*2:

$$D^2\left(\mathbf{y}\right) = K\left(\mathbf{y}\cdot\mathbf{y}\right) - 2\sum\_{i=1}^{N} a\_i K\left(\mathbf{y}\cdot\mathbf{x}\_i\right) + \sum\_{i,j=1}^{N} a\_i a\_j K\left(\mathbf{x}\_i \cdot \mathbf{x}\_j\right) \le R^2 \tag{15}$$

The kernel term *K* **x***<sup>i</sup>* · **x***<sup>j</sup>* is commonly used to replace the inner product **x***<sup>i</sup>* · **x***<sup>j</sup>* , which is the Gaussian kernel here:

$$K\left(\mathbf{x}\_{i}\cdot\mathbf{x}\_{j}\right) = \exp\left(-\frac{||\mathbf{x}\_{i} - \mathbf{x}\_{j}||^{2}}{\sigma^{2}}\right) \tag{16}$$

#### **3. Fault Detection and Identification Strategy**

In order to detect and identify a process fault, it is essential to characterize the normal operating condition (NOC). Hence, a training dataset collected under normal operational condition is used to construct the GRU neural network. The GRU neural network generates model residuals, which is further used to construct monitoring statistics using SVDD. As described earlier, the GRU model is capable of extracting the spatial and temporal signatures in the data that are important

for characterizing complex ironmaking process. The general framework for fault detection and identification based on GRU-SVDD is described in detail in the following subsections.

#### *3.1. Fault Detection*

In order to detect a process fault, it is required to train a model based on the NOC data. In the ironmaking process, this involves training a GRU with multiple time series to model temporal dynamics and correlations between process variables.

The GRU model is trained on historical normal data. Specifically, the GRU model uses the past information captured by its cell value and current observation to predict the next observation. Assume a training set **<sup>x</sup>***<sup>i</sup>* ∈ R*d*, *<sup>i</sup>* = 1, ···, *<sup>N</sup>* is collected under NOC, a moving window approach can be applied, with the window length being *n*, *n N*. Take the first window as an example, the structure of a two-layer GRU is shown in Figure 3.

**Figure 3.** Structure of two layers GRU.

Here, **<sup>h</sup>***<sup>i</sup>* ∈ R*dh* denotes the hidden state of the first layer at the *<sup>i</sup>*th time, **<sup>h</sup>** *<sup>i</sup>* ∈ R*<sup>d</sup> <sup>h</sup>* denotes the hidden state of the second layer and **<sup>x</sup>**ˆ*n*+<sup>1</sup> ∈ R*<sup>d</sup>* is the predicted value. The hidden state **<sup>h</sup>***<sup>i</sup>* of the first layer becomes the input to the second layer of GRU model. The final output is then obtained using the dense layer as follows.

$$
\hat{\mathbf{x}}\_{n+1} = \mathbf{W}^{\prime} \mathbf{h}\_n^{\prime} + \mathbf{b}^{\prime} \tag{17}
$$

Here **x**ˆ*n*+<sup>1</sup> is the prediction of **x** at the *n* + 1th time instance, **W** is the weight matrix of the dense layer, **b** is the bias term. A GRU model with more layers can also be used, for the sake of simplicity, however two-layer GRU is considered here.

The model parameters can be trained based on the *N* − *n* + 1 windows. Once the model parameters are estimated, estimation of model output **x**ˆ*n*+<sup>1</sup> can be predicted from the past *n* samples. A series of residuals can be obtained as **e***<sup>i</sup>* = |**x**ˆ*<sup>i</sup>* − **x***i*| , *i* = *n* + 1, ··· , *N*. The residual series obtained from GRU under NOC is then fed into the SVDD to estimate the parameters, namely the center **a** and the radius *R* of the hypersphere. Whenever a new sample is available, the residuals obtained from the GRU can be fed into the SVDD to calculate the the squared distance *D*<sup>2</sup> according to Equation (15). If *D*<sup>2</sup> is greater than *R*2, it is faulty, otherwise it is normal.
