*3.3. Convolutional Neural Networks Layer*

There are only two dimensions in the input, among which *xi* ∈ *<sup>d</sup>* represents the d-dimensional vector for the *<sup>i</sup>*-th word in the sentence, while *xi* ∈ *<sup>d</sup>* denotes the input sentence. Moreover, L refers to the length of the sentence. Furthermore, one-dimensional convolution is employed to extract features from the output of LSTM layer.

A filter sliding over the input sequence is adopted by the one-dimensional convolution to detect features from different positions. Vector of the word in the sliding filter is denoted by *xj*, *xj*+1, ··· , *xj*<sup>+</sup>*k*−1, respectively. The window vector can be represented by the formula as follows:

$$\mathcal{W}\_{j} = [\mathbf{x}\_{j'}, \mathbf{x}\_{j+1'} \cdots \mathbf{x}\_{\prime}, \mathbf{x}\_{j+k-1}].\tag{11}$$

The window vectors that are related to with the word *xj* are: *wj*−*k*<sup>+</sup>1, *wj*−*k*<sup>+</sup>2, ··· , and *wj*, respectively. For each *wj*, window vector, its feature map can be expressed by the formula follows:

$$\mathcal{L}\_{\hat{\jmath}} = f(w\_{\hat{\jmath}} \diamond m + b),\tag{12}$$

where ◦ refers to the dot product, *b* ∈ *R* represents a bias term and *f* denotes a nonlinear transformation function that can be sigmoid, and hyperbolic tangent, etc. In our experiment, ReLU is chosen as the nonlinear function. In our model, n filters are adopted by us to produce the feature maps as follows:

$$\mathbf{W} = [\mathbf{c}\_1, \mathbf{c}\_2, \cdots, \mathbf{c}\_n]. \tag{13}$$

In the formula above, *ci* refers to the feature map generated by the *i*-th filter. The convolution layer may have multiple filters of the same size to learn complementary features, or multiple kinds of filters with different sizes.

Then, a max-over pooling is applied to this feature map for the purpose of obtaining a vector of fixed length for classification. As for the pooling operation, it is adopted to extract maximum value from the matrix (feature map). After each convolution, a max-pool layer is added to extract the most significant elements in each convolution and then they are turned into feature vectors. It is common to periodically insert a pooling layer in-between successive Conv layers in a ConvNet architecture, which can progressively reduce the spatial size of the representation so as to reduce the amount of parameters and computation in the network, thus controlling the overfitting. Two common pooling operations, max pooling and average pooling, are shown in Figure 2. Max pooling chooses the max value in the filter as the new value in the new matrix while average pooling adopts the average value of all the numbers existing in the filter.

**Figure 2.** Pooling operation.

## *3.4. Proposed BLSTM-C Model*

As shown in Figure 3, our model begins with a BLSTM layer to obtain a sequence output on the basis of the past context and the future context. Then, this sequence is fed to the CNN layer that is utilized to extract features from the previous sequence. After that, a max-over pooling layer is adopted to obtain a fixed length vector that is fed to the output layer that employs softmax function to classify the input. Blocks of the same color in the feature map layer and the window feature sequence layer correspond to features of the same window.

**Figure 3.** The architecture of the BLSTM-C model.

In the theory of probability, the output of the softmax function can be employed to represent a categorical distribution, that is, a probability distribution over K (number of different possible outcomes). As for our experiment, the probability distribution obtained will be over the categories of dataset. Moreover, the biggest one is the category that this input text belongs to. The function is shown as follows:

$$\sigma: \mathbb{R}^K \to \left\{ z \in \mathbb{R}^K \mid z\_l > 0, \sum\_{l=1}^K = 1 \right\},$$

$$
\sigma(\mathbf{z})\_j = \frac{\mathfrak{e}^{z\_j}}{\sum\_{k=1}^{K} \mathfrak{e}^{z\_k}}.\tag{15}
$$

#### **4. Experiment**
