*3.2. LSTM-CRF Model*

LSTM networks are a specific form of recurrent neural network (RNN) proposed by Sepp Hochreiter, Jurgen Schmidhuber, and colleagues [22]. Unlike the general RNN model, conventional neurons are replaced with memory cells, each of which consists of input gates, forget gates, and output gates. This approach not only allows long-term dependencies to be identified but also avoids the problem of gradient disappearance or gradient expansion. In this article, the LSTM-CRF model is like that proposed by Huang et al. [16]. The model can be formulated as follows:

$$\mathbf{i}\_l = \boldsymbol{\sigma} \left( \mathbf{W}\_{\text{xi}} \,\, \mathbf{x}\_l + \mathbf{W}\_{\text{hi}} \,\, \mathbf{h}\_{l-1} + \mathbf{W}\_{\text{cj}} \,\, \mathbf{c}\_{l-1} + \mathbf{b}\_l \right) \tag{1}$$

$$f\_t = \sigma \left( \mathbf{W}\_{xf} \mathbf{x}\_t + \mathbf{W}\_{hf} \mathbf{h}\_{t-1} + \mathbf{W}\_{cf} \mathbf{c}\_{t-1} + \mathbf{b}\_f \right) \tag{2}$$

$$\mathcal{E} = \tanh\left(\mathcal{W}\_{\text{xc}}\mathbf{x}\_t + \mathcal{W}\_{\text{hc}}\mathbf{h}\_{t-1} + \mathbf{b}\_c\right) \tag{3}$$

$$\mathfrak{c}\_{t} = \mathfrak{f}\_{t} \odot \mathfrak{c}\_{t-1} + \mathfrak{i}\_{t} \odot \tilde{\mathfrak{c}} \tag{4}$$

$$\boldsymbol{\sigma}\_{t} = \boldsymbol{\sigma} \left( \mathbf{W}\_{\text{xo}} \,\mathbf{x}\_{t} + \mathbf{W}\_{\text{ho}} \,\mathbf{h}\_{t-1} + \mathbf{W}\_{\text{co}} \,\mathbf{c}\_{t} + \mathbf{b}\_{o} \right) \tag{5}$$

$$h\_t = \mathfrak{o}\_t \odot \tanh(\mathfrak{c}\_t) \tag{6}$$

where *f*, *i*, and *o* represent the forget, input, and output gates, respectively; *ct*−<sup>1</sup> represents the state of the cell at time *t* − 1; *ct* represents the state of the cell at time *t*. *ht* represents the output of the current state; *ht*−<sup>1</sup> represents the output of the unit at the previous moment; σ is the logistic sigmoid function; W and b represent the weight and bias, respectively; and is the element-wise product.

The CRF model is an undirected graph model that was proposed by Lafferty et al. [23] in 2001. The model has obvious advantages in labeling and segmenting serialized data. It directly models the conditional distribution of data and can effectively avoid the problem of mark biasing experienced by other discriminant models. Moreover, the CRF model is different from other production models because it does not rely on the independence assumption, which allows it to fuse a variety of complex, non-local features and better capture the potential relationships among states.

The goal of the CRF model is to calculate the conditional probability distribution of the optimal output sequence (predictive sequence) given the input sequence (observed sequence), i.e., *P*(*Y|X*). For example, assume that the random variable *X* = (*x*1, *x*2, ... , *xn*−1, *xn*) is the input sequence and the random variable *Y* = (*y*1, *y*2, ... , *yn*−1, *yn*) is the output sequence. Then, under the condition that the observation sequence is X, the conditional probability distribution of the prediction sequence Y is as shown in Formulas (7) and (8):

$$P(X|Y) = \frac{1}{Z(X)} \exp\{\sum\_{k} \lambda\_k a\_k (y\_{i-1}, y\_{j}, \mathbf{x}\_{\prime}; i) + \sum\_{k} \mu\_k \beta\_k (y\_{i\prime}, \mathbf{x}\_{\prime}; i)\}\tag{7}$$

$$Z(X) = \sum\_{y} \exp(\sum\_{k} \lambda\_k a\_k (y\_{i-1}, y\_{i\prime}, \mathbf{x}, i) + \sum\_{k} \mu\_k \beta\_k (y\_{i\prime}, \mathbf{x}, i)) \tag{8}$$

where *α<sup>k</sup>* and *β<sup>k</sup>* are binary functions (0 or 1); *α<sup>k</sup>* is a transfer eigenfunction that reflects the correlation between adjacent marker variables at two positions, *i* − 1 and *i*, and the effect of observation sequences on marker variables; *β<sup>k</sup>* is a state eigenfunction that represents the effect of a node at observation sequence position i on the marker variable; *λ<sup>k</sup>* and *μ<sup>k</sup>* are parameters corresponding to the characteristic functions *α<sup>k</sup>* and *βk*, respectively; and *Z(X)* is the normalization factor.

The LSTM-CRF model is widely used in the NER task. The powerful data fitting ability of the LSTM model can be utilized, and sequence labeling can be directly implemented by the CRF. The labeling of each current word is related to the labeling result at the previous moment. The LSTM-CRF model consists of a word embedding layer, a bidirectional LSTM layer, and a CRF layer. The first word embedding layer uses pretraining to map each word w*<sup>i</sup>* of a sentence into a *d*-dimensional word vector. The LSTM-CRF model architecture is shown in Figure 2.

**Figure 2.** LSTM-CRF model architecture.

*Future Internet* **2019**, *11*, 17

The word vector is input into the bidirectional LSTM layer, and the hidden state of the forward LSTM output and the hidden state of the backward LSTM output are obtained. The hidden state sequence output by each position is spliced by position to obtain a complete hidden state sequence *ht* = [ → *ht*; ← *ht*].

Next, P is regarded as the matrix of scores outputted by the LSTM network of size *n* × *k*, where *k* indicates the category of the tags. P*i, j* corresponds to the score of the j-th tag of the i-th word in a sentence. Given an input sentence *X* = (*x*1, *x*2, ... , *xn*) and a sequence of predictions, *y* = (*y*1, *y2*, ... , *yn*), the associated score is defined as follows:

$$\text{score}(X, y) = \sum\_{i=1}^{n} P\_{i, yi} + \sum\_{i=1}^{n} A\_{yi - 1, yi} \qquad \text{s.t.} \\ P \in \mathbb{R}^{n \times k}, A \in \mathbb{R}^{(k+2) \times (k+2)} \tag{9}$$

where *Ai,j* is the transfer score from the *i*th tag to the *j*th tag. The start and end tags are added to the set of possible tags, and these tags, namely, *y*<sup>0</sup> and *yn*, denote the start and end of a sentence. Thus, *A* is a square matrix of size *k* + 2. After applying a softmax function to all possible tag sequences, the probability of the sequence *y* is as follows:

$$P(y|X) = \frac{\exp(score(X, y))}{\sum\_{y'} \exp(score(X, y'))} \tag{10}$$

The model is trained by maximizing the log likelihood function:

$$\log P(y^x | X) = \operatorname{score}(X, y^x) - \log(\sum\_{y^y} \operatorname{score}(X, y^y)) \tag{11}$$

Finally, the Viterbi algorithm is used to solve the optimal path:

$$\text{tg}^\* = \arg\max\_{y^\prime} score(X, y^\prime) \tag{12}$$
