*2.6. Basic Principle of LSTM*

RNN was first proposed in the 1980s. As a popular algorithm in deep learning, compared with deep learning network (DNN), its circular network structure allows it to take full advantage of the sequence information in the sequence data itself. Therefore, it has many advantages in dealing with time series. Moreover, the ability to correct errors is achieved through back-propagation and a gradient descent algorithm. However, there are also many problems: as the time series grows, researchers have found that RNNs are weak for long time series, which means that the long-term memory of RNNs is poor. At the same time, as the length of the sequence increases, the depth of the model increases, and the problem of gradient disappearance and gradient explosion cannot be avoided when calculating the gradient. Therefore, Hochreiter et al. [69] proposed LSTM. The structure of LSTM is shown in Figure 6 [70].

**Figure 6.** Long short-term memory network topology diagram.

The long-short term memory network is different from the traditional recurrent neural network in rewriting memory at each time step. LSTM will save the important features it has learned as long-term memory, and selectively retain, update, or forget the saved long-term memory according to the learning. However, the features with small weight in multiple iterations will be regarded as short-term memory and eventually forgotten by the network. This mechanism allows the important feature information to be transmitted all the time with the iteration so that the network has better performance in the classification task with a long-time dependence of samples. LSTM has been widely applied in flood sensitivity prediction [71], the prediction of key parameters of nuclear power plants [72], wind speed prediction [73,74], financial price trends [75], language processing [76], etc. In recent years, the LSTM model has made a series of improvements on the basis of RNN neurons. These include the addition of a transmission unit state in the RNN hidden layer controlled by three gating units: the forgetting gate, input gate, and output gate. Forgetting gates are used to control the forgetting of information and the extent to which it is retained. The calculation formula is:

$$F\_t = \sigma(\mathcal{W}\_{\mathcal{F}} \cdot [h\_{t-1}, \mathcal{X}\_t] + b\_{\mathcal{F}}) \tag{6}$$

where *Xt* is the current input information, *ht*−<sup>1</sup> is the data information in the previous hidden state, and the range of *Ft* is 0 to 1. When *Ft* = 1, it means that the information is completely retained, and when *Ft* = 0, it means that the information is completely abandoned.

The input gate is used to control how much input information at the current time is saved to the unit state. The expression is written as:

$$I\_t = \sigma\left(\mathcal{W}\_i \cdot \left[\mathbf{h}\_{t-1}, \mathbf{x}\_t\right] + b\_i\right) \tag{7}$$

where *Wi* is the weight matrix, *bi* is the offset term, and *It* is the input layer vector value. The input unit status *Ct* is represented as:

$$\mathbf{C}\_{t} = F\_{\bar{i}} \odot \mathbf{C}\_{t-1} + I\_{t} \odot \mathbf{C}\_{t} \tag{8}$$

$$\widetilde{\mathbb{C}}\_{t} = \tanh(\mathcal{W}\_{\mathbb{C}} \cdot [h\_{t-1}, \mathbf{x}\_{t}] + b\_{\mathbb{C}}) \tag{9}$$

where *Wc* is the weight matrix and *bc* is the offset term.

The output calculation formula of the output gate *Ot* is shown as:

$$O\_t = \sigma\left(\mathcal{W}\_o \cdot \left[h\_{t-1}, X\_t\right] + b\_o\right) \tag{10}$$

where *bo* is the offset value, *Wo* is the judgment matrix, and *ht*−<sup>1</sup> is the hidden layer state at time (*t*−1).

$$h\_t = O\_t \odot \tanh(\mathcal{C}\_t) \tag{11}$$

In Equation (11), is the Hadamard product and *ht* is the hidden layer state at time *t*.
