*2.1. GRU Neural Network*

GRU is a type of recurrent neural network. The main difference between RNN and feed-forward artificial neural network is in their structure. In a feed-forward artificial neural network, signals travel from the inputs to outputs and the flow of information is in the forward direction only. Since there is no backward/feedback flow, the name of "feed-forward" is justified. In contrast, an RNN allows feedback from output to input and hence it is called "recurrent". In addition, the output of the previous time step/state in RNN will be used as the input of the next time step, which is different from feed-forward neural network that considers fixed length input and fixed length output only. With this kind of recurrent structure, RNN can be used to learn the characteristics of the time series and make predictions. A widely used RNN is the LSTM network, which is very suitable to capture long-term dependencies and also able to avoid the vanishing gradient problem. As an improvement of LSTM, GRU network inherits its advantages, whilst having an optimized structure and fewer parameters, resulting in lower computation load and better generalization ability.

#### 2.1.1. The Structure of LSTM Cell

LSTM [17] was originally proposed in 1997, in order to solve the vanishing gradient problem faced by RNN [18]. The main difference between LSTM and standard RNN network is the handling of long-term dependencies. In the RNN network, each cycle involves only the last state and the current input. Because each prediction only involves the state at the last moment, the RNN can only establish a dependency relationship between states in a short time. In contrast, LSTM can establish dependencies between states at arbitrary long intervals, so they are called "Long-Short Term Memory network". In addition, the LSTM has a cell state update process similar to the conveyor belt structure. The old cell state will remain on the conveyor belt until it needs to be forgotten by structure called "gate". Through this conveyor belt structure, LSTM can take long-term memory from the conveyor belt at any time for learning the characteristics of time series and make predictions. An LSTM unit consists of a cell, an input gate, a forget gate and an output gate. The cell is used to record state values at different time intervals and the three gates are used to control the flow of the information. The introduction of three gates enables LSTM to keep, utilize, or discard a state when necessary.

Let **x***<sup>t</sup>* denote a data sample at the *t*th time instance, **C***t*−<sup>1</sup> denote the cell value and **h***t*−<sup>1</sup> the hidden state of each cell at the *t* −1th time instance. The information of previous time is stored in **C***t*−<sup>1</sup> and **h***t*−1. The input gate regulates to what extent a new value **x***<sup>t</sup>* is transferred into the cell, the forget gate controls to what extent **C***t*−<sup>1</sup> remains in the cell and the output gate regulates to what extent **C***t*−<sup>1</sup> is used to calculate the output activation. The structure of a standard LSTM cell is shown in Figure 1.

**Figure 1.** Structure diagram of LSTM cell.

In Figure 1, the green box, blue box and red box correspond to the input gate, the output gate and the forget gate respectively. The mathematic formulation of the forget gate is described as:

$$\mathbf{f}\_t = \sigma \left( \mathbf{W}\_f \cdot [\mathbf{h}\_{t-1}, \mathbf{x}\_t] + \mathbf{b}\_f \right) \tag{1}$$

where **W***<sup>f</sup>* is the weight matrix of the forget gate; *σ* is the sigmoid activation function; **b***<sup>f</sup>* is the bias vector for the forget gate; [**h***t*−1, **x***t*] is a vector that merge the previous cell state vector **h***t*−<sup>1</sup> and the input vector **x***<sup>t</sup>* at the current moment. The input gate is used to decide what information will be saved in the cell value. On the other hand, the input gate can be described mathematically as follows.

$$\mathbf{i}\_{l} = \sigma \left( \mathbf{W}\_{l} \cdot \left[ \mathbf{h}\_{l-1}, \mathbf{x}\_{l} \right] + \mathbf{b}\_{l} \right) \tag{2}$$

$$\mathbf{C}\_{t} = \tanh\left(\mathbf{W}\_{c} \cdot [\mathbf{h}\_{t-1}, \mathbf{x}\_{t}] + \mathbf{b}\_{i}\right) \tag{3}$$

Here, **W***<sup>i</sup>* is the weight matrix in the input gate, **b***<sup>i</sup>* is the bias vector. The input gate adds new information generated by the current input to the cell value, and creates new memories: **i***<sup>t</sup>* and **C**˜ *<sup>t</sup>*. The current state **<sup>C</sup>***<sup>t</sup>* is updated based on the previous cell value **<sup>C</sup>***t*−1, the new memories **<sup>i</sup>***<sup>t</sup>* and **<sup>C</sup>**˜ *<sup>t</sup>* as follows.

$$\mathbf{C}\_{l} = \mathbf{f}\_{l} \cdot \mathbf{C}\_{t-1} + \mathbf{i}\_{l} \cdot \mathbf{\tilde{C}}\_{t} \tag{4}$$

Finally, the hidden state *ht* is updated in the output gate as:

$$\mathbf{o}\_{l} = \sigma \left( \mathbf{W}\_{\mathcal{O}} \cdot \left[ \mathbf{h}\_{l-1}, \mathbf{x}\_{l} \right] + \mathbf{b}\_{\mathcal{O}} \right) \tag{5}$$

$$\mathbf{h}\_{l} = \mathbf{o}\_{l} \cdot \tanh\left(\mathbf{C}\_{l}\right) \tag{6}$$

where **W***o* is the weight matrix in the output gate. In this way, the cell value **C***t* and hidden state **h***t* can be updated whenever a new sample **x***t* is available.
