*3.2. BLSTM Layer*

Long Short-term Memory (LSTM) [3] is developed on the basis of Recurrent Neural Network (RNN) to solve the problems related to gradient vanishing or exploding. The mainly idea used by it is adding "gate" to the Recurrent Neural Network for the ultimate purpose of controlling the passing data. Generally speaking, a common architecture of LSTM units is composed of a memory cell, an input gate, an output gate and a forget gate. LSTM is shown in the form of a chain that is constructed by repeating modules of neural networks. With information stored inside, the memory cell runs across the whole chain. In addition, the other three gates are mainly designed to control whether to add or block the information to the memory cell.

With the output from *ht*−1, the old hidden state, and the input from *xt*, the current moment, the gates determine how to update the current memory cell and *ht*, the current hidden state. The output of the forget gate represents the proportion of information that will be kept, while the input gate is mainly responsible for the addition of information to the memory cell. Moreover, this addition operation is made up of three parts. Firstly, a sigmoid layer is used to regulate the information that is required to be added to the memory cell. Secondly, the *tanh* function is adopted to obtain a vector through *ht*−<sup>1</sup> and *xt*. Finally, multiply these two values obtained and feed them to the memory cell. The output gate is responsible for the task of selecting useful information from the memory cell to output. For this purpose, it should firstly create a vector by applying the *tanh* function to the cell state and then regulate the information from *ht*−<sup>1</sup> to *xt* and multiply it by the vector created before. In this way, the output for this moment is obtained.

The LSTM transition functions are defined as follows:

$$\mathbf{i}\_t = \sigma(\mathcal{W}\_t \left[ h\_{t-1}, \mathbf{x}\_t \right] + b\_i), \tag{5}$$

$$f\_t = \sigma(\mathcal{W}\_f \left[ h\_{t-1}, \mathbf{x}\_t \right] + b\_f), \tag{6}$$

$$\circ\_t \circ\_t = \sigma(\mathsf{W}\_o \left[ h \cdot\_{\cdot}, \mathsf{x} \cdot \right] + b\_o), \tag{7}$$

$$\varepsilon\_t = \tanh(\mathcal{W}\_c[h\_{t-1}, \mathbf{x}\_t] + b\_f), \tag{8}$$

$$\mathfrak{c}\_{t} = f\_{t} \bullet \mathfrak{c}\_{t-1} + i\_{t} \bullet \mathfrak{c}\_{t} \tag{9}$$

$$h\_l = o\_l \bullet \tanh(c\_l). \tag{10}$$

*σ* refers to the logistic sigmoid function that has an output in [0, 1], *tanh* indicates the hyperbolic tangent function that has an output in [−1, 1], and • denotes the elementwise multiplication. At the current time *t*, *ht* refers to the hidden state, *ft* represents the forget gate, *it* indicates the input gate, and *ot* denotes the output gate. *Wt*, *Wo*, and *Wf* represent the weight of these three gates, respectively, while *bt*, *bo*, *bf* refers to the biases of the gates.

As for BLSTM, it is regarded as an extension of the undirectional LSTM, and it not only adds another hidden layer but also connects with the first hidden layer in the opposite temporal order. Because of its structure, BLSTM can process the information from both the past and the future. Therefore, BLSTM is adopted to capture the information of the text input in this paper.
