*3.2. Feature Attention Mechanism*

Generally, the feature attention mechanism can improve the performance of Bi-LSTM by dynamically assigning the attention weight to input features, as well as the correlation between hidden layer and target features being mined, which can effectively reduce the loss of feature correlations. The architecture of the feature attention mechanism is shown in Figure 2.

**Figure 2.** Architecture of feature attention mechanism.

From Figure 2, the input feature vector of time sequences with *K* hidden layer features can be described as *Xt* = [*X*1,*t*, *X*2,*t*,..., *Xk*,*t*]. Then, a single layer neural network is used to calculate the attention weight vector, which can be expressed by:

$$x\_{k,t} = \sigma(\mathcal{W}\_c X\_t + b\_c) \tag{6}$$

where *t* is the time length of input sequences depending on sampling rates, and *ek*,*<sup>t</sup>* = [*e*1,*t*,*e*2,*t*,...,*ek*,*t*] is regarded as the combination of attention weight coefficients corresponding to the input characteristics of current moments. *We* is the trainable weight matrix, *be* is an offset vector, and *σ*(.) is a sigmoid activation function.

The data sequence generated by the sigmoid activation function is normalized by the softmax function, which is denoted as:

$$\alpha\_{k,t} = \frac{\exp(\varepsilon\_{k,t})}{\sum\_{i=1}^{k} \varepsilon\_{i,t}} \tag{7}$$

where *αk*,*<sup>t</sup>* is the attention weight of character *k*, and the resulting attention weight *αk*,*<sup>t</sup>* and hidden layer feature vector *X <sup>t</sup>* are recalculated as a weighted feature vector *X <sup>a</sup>*\_*t*, which can be expressed by:

$$X\_{a\\_t}' = a\_t \odot X\_t' = \begin{bmatrix} a\_{1,t}x\_{1,t}, a\_{2,t}x\_{2,t}, \cdots, a\_{k,t}x\_{k,t} \end{bmatrix} \tag{8}$$
