*2.1. Feature-Enhanced Word Embedding Module*

The first step is to pre-process the input sentence and sentiment resource words. In order to transfer each word in the sentence to a real-value vector, we applied the pre-trained GloVe [18] method in the word embedding layer. Let *L* ∈ *R*|*V*|×*<sup>d</sup>* be the embedding lookup table generated by GloVe, where *d* is the dimension of word vector and |*V*| is the vocabulary size. Suppose the input sentence consists of *n* words and the sentiment resource sequence consists of *m* words. The input sentence retrieves the word vectors from *<sup>L</sup>* and gets a list of vectors [*w*1, *w*2, ··· , *wn*] where *wi* ∈ *R<sup>d</sup>* is the word vector of the *i th* word. Similarly, the sentiment resource sequence can retrieve the word vectors from *L* and form a list of vectors *ws* <sup>1</sup>, *ws* <sup>2</sup>, ··· , *ws m* . In this way, we can get the matrix *W<sup>c</sup>* = [*w*1, *w*2, ··· , *wn*] ∈ *Rn*×*<sup>d</sup>* for context words and the matrix *<sup>W</sup><sup>s</sup>* <sup>=</sup> *ws* <sup>1</sup>, *ws* <sup>2</sup>, ··· , *ws m* ∈ *Rm*×*<sup>d</sup>* for sentiment resource words.

After obtaining the general matrix of the word vector, we propose the novel sentiment attention mechanism to help highlight the vital sentiment-resource-relevant context words for generating the sentence representation with enhanced sentiment features. Specifically, we leverage sentiment words in the sentiment lexicon as sentiment resource words and integrate them with the attention mechanism to emphasize more important information related to the sentiment polarity. The sentiment lexicon collects sentiment resource words from both Hu and Liu [19] and Qian et al. [20], including 10,899 sentiment resource words in total. Then, the attention mechanism uses sentiment resource words as an attention source attending to the context words to learn the feature-enhanced word embedding. In the following, we will describe the sentiment attention mechanism in detail.

First, inspired by the fact that the sentiment words can largely guide the sentiment polarity in the context of a sentence, we plan to design the word-level relationship between the sentiment words and the context words. For example, in the sentence "This movie is so wasteful of talent, it is truly disgusting", composed of the sentiment words (i.e., "wasteful" and "disgusting") and the context words (i.e., all words in this sentence except the sentiment words), "wasteful" and "disgusting" play a key role in directing the sentiment polarity of this sentence. Mathematically, we adopt the dot product operation between the context words and the sentiment words to form a correlation matrix. The specific calculation method is as follows:

$$\mathcal{M}^{\mathfrak{s}} = \mathcal{W}^{\mathfrak{c}} \circ \left(\mathcal{W}^{\mathfrak{s}}\right)^{\mathrm{T}} \tag{1}$$

where ◦ denotes the dot product operation and *Ms* ∈ *Rn*×*<sup>m</sup>* represents the relevance matrix between the context words and the sentiment words.

Then, we define the context-word-relevant sentiment word representation matrix *X<sup>s</sup>* generated by the dot product operation between the context words *W<sup>c</sup>* and the correlation matrix *Ms* :

$$X^{\mathbf{s}} = \left(\mathcal{W}^{\mathbf{c}}\right)^{T} \circ \mathcal{M}^{\mathbf{s}} \tag{2}$$

where *X<sup>s</sup>* ∈ *Rd*×*<sup>m</sup>* represents the sentiment word representation matrix related to the context words. Similarly, we can compute the sentiment-word-relevant context word representation matrix *X<sup>c</sup>* by the dot product between the sentiment words *W<sup>s</sup>* and the correlation matrix *Ms* :

$$X^{\mathfrak{c}} = \left(M^{\mathfrak{s}} \circ \mathcal{W}^{\mathfrak{s}}\right)^{\mathfrak{T}} \tag{3}$$

where *X<sup>c</sup>* ∈ *Rd*×*<sup>n</sup>* represents the context word representation matrix related to the sentiment words. The illustration of sentiment-context word correlation is shown in Figure 2.

**Figure 2.** Sentiment-context word correlation.

After obtaining the sentiment-context word correlation, we adopt the attention mechanism to highlight the information contributing to predicting the sentiment polarity of the input sentence. As described in this section, we consider the sentiment influence on the context from the sentiment words, which can provide more clues to pay attention to the related sentiment features. Meanwhile, we can handle some complex situations with changing sentiment by using the attention mechanism. For example, in the sentence "It is actually my favorite kind of film, but as an adaptation, it fails from every angle", a real-world man will focus on the word "favorite" after reading the first clause of this sentence. Because "favorite" is the word that largely represents the sentiment polarity of the current

sentence. Until the last word in this sentence is read, the man's attention will turn to the word "fail", which determines the sentiment polarity of the whole sentence. By using the attention mechanism that can simulate the real-world man's attention, "fails" in the context of "it fails from every angle" will be assigned more "attention" than "favorite" in the context of "my favorite kind of film". Formally, the attention score function β is defined as follows:

$$t\_s = \frac{\sum\_{i=1}^{m} X\_i^s}{m} \tag{4}$$

$$\beta\left(\left[X\_{i}^{c};t\_{s}\right]\right) = \mu\_{s}^{T}\tanh\left(\mathcal{W}\_{s}\left[X\_{i}^{c};t\_{s}\right]\right) \tag{5}$$

where *ts* denotes the overall representation of sentiment words, *u<sup>T</sup> <sup>s</sup>* and *Ws* are learnable parameters. With the attention score function β that calculates the importance of the *i*th word *X<sup>c</sup> <sup>i</sup>* in the context, the attention mechanism generates the attention vector α*<sup>i</sup>* by:

$$\alpha\_{i} = \frac{\exp\left(\beta\left(\left[\mathbf{X}\_{i}^{c}; t\_{s}\right]\right)\right)}{\sum\_{i=1}^{n} \exp\left(\beta\left(\left[\mathbf{X}\_{i}^{c}; t\_{s}\right]\right)\right)}\tag{6}$$

where α*<sup>i</sup>* represents the importance of the *i*th word in the sentence. Finally, we can attend the attention vector α*<sup>i</sup>* to the context words:

$$\alpha\_i = \alpha\_i X\_i^c \tag{7}$$

where *xi* represents the *i*th word of the original input sentence. The final output of the feature-enhanced word-embedding layer is *X* = [*x*1, *x*2, ··· , *xn*]. Although the attention mechanism can highlight the sentiment features in the sentence, the model cannot fully capture the interactive information of the textual structure. In order to enhance the final representation of the sentence, we pass the sentence representation generated by the feature-enhanced word-embedding module to the deep neural network module.

#### *2.2. Deep Neural Network Module*

The deep neural network module is composed of two parts: Bi-GRU and CNN. As we all know, text is structured and organized. With the purpose of avoiding the destruction of the sequence structure of text, our model passes the output of feature-enhanced word-embedding layer to the Bi-GRU layer and then to the CNN layer, and obtain the final representation of the input sentence. In the following, we will detail the two parts in the order of the data flow.

As shown in Figure 1, the Bi-GRU layer contains two sub-layers for the forward and backward sequences, respectively, which is beneficial to have access to the future context as well as the past context. Since the GRU unit is a variant of LSTM, we first briefly review the LSTM for the sequence modeling task. The main idea of the LSTM is to overcome the problem of gradient vanishing and expansion in the recurrent neural network by introducing an adaptive gating mechanism that controls the data flow to and from their memory cell units. Taking the sequence vector *X* = [*x*1, *x*2, ··· , *xn*] from the output of feature-enhanced word-embedding layer as an example, LSTM processes the data word-by-word corresponding to the time-step from the past to the future. At time step *t*, the current hidden state *ht* and the current memory-cell state *ct* are calculated as follows:

$$\dot{\mathbf{u}}\_{t} = \text{sigmoid}\left(\mathcal{W}\_{l} \cdot \left[\mathbf{h}\_{t-1}, \mathbf{x}\_{t}\right] + \mathbf{b}\_{l}\right) \tag{8}$$

$$f\_l = \text{sigmoid}\left(\mathcal{W}\_f \cdot [h\_{t-1}, \mathbf{x}\_l] + b\_f\right) \tag{9}$$

$$\log \mathbf{t} = \text{sigmoid}\left(\mathbb{W}\_{\mathbf{0}} \left[\mathbf{h}\_{\mathbf{t}-1\mathbf{1}}, \mathbf{x}\_{\mathbf{t}}\right] + \mathbf{b}\_{\mathbf{0}}\right) \tag{10}$$

$$\overline{\mathfrak{C}\_t} = \
tanh\left(\mathsf{W}\_{\mathcal{C}} \cdot \left[\mathsf{h}\_{t-1}, \mathsf{x}\_t\right] + \mathsf{b}\_{\mathcal{C}}\right) \tag{11}$$

$$c\_t = f\_t \odot c\_{t-1} + \ i\_t \odot \overline{c\_t} \tag{12}$$

$$h\_l = o\_l \odot \tan \mathbf{h}(c\_l) \tag{13}$$

where · denotes matrix multiplication and stands for element-wise multiplication. *it* refers to the input gate which controls the input of new information to the memory cell, *ft* indicates the forget gate, which controls how long certain values are held in the memory cell, and *ot* represents the output gate which controls how much the values stored in the memory cell affect the output activation of the block. *Wi*, *Wf* , and *Wo* are the weight matrixes for these three gates, respectively, while *bi*, *bf* , *bo* are the bias vectors for these gates.

As for GRU, it is regarded as an optimization of LSTM, and it can not only merge the input gate *it* and the forget gate *ft* of LSTM into the reset gate *rt*, but also optimize the update mode of the hidden state *ht*. In addition, the output gate *ot* corresponds to the update gate *zt*. Throughout this design, the GRU can maintain the advantage of the LSTM while having a simpler structure, fewer parameters, and better convergence than the LSTM [13]. As shown in Figure 3, at each time-step *t*, the GRU transition functions are defined as follows:

*rt* = *sigmoid* (*Wr* ·[*ht*−1, *xt*]) (14)

$$z\_t = \text{sign}(\mathcal{W}\_z \cdot [\mathbf{h}\_{t-1}, \mathbf{x}\_t]) \tag{15}$$

$$\widetilde{h\_t} = \tanh\left(\mathbb{W}\_{\widetilde{h}} \cdot [r\_t \odot h\_{t-1}, \mathbf{x}\_t]\right) \tag{16}$$

$$h\_t = \left(1 - z\_t\right) \odot h\_{t-1} + z\_t \odot h\_t \tag{17}$$

**Figure 3.** The architecture of the gated recurrent unit (GRU) cell used in the SDNN model.

In the Bi-GRU layer, the forward GRU layer processes the sentence word-by-word in the order of the input sequence to obtain the hidden state <sup>→</sup> *ht* at each time-step *t*. The backward GRU layer does the same thing, except that its input sequence is reversed. The final hidden state *ht* at time step *t* is updated as:

$$
\overrightarrow{h\_{l}} = \overrightarrow{GRU} \text{ (x\_{l}, h\_{l-1})}\tag{18}
$$

$$
\stackrel{\leftarrow}{h\_t} = \stackrel{\leftarrow}{GR}I \begin{pmatrix} \mathbf{x}\_t \ \stackrel{\leftarrow}{h\_{t-1}} \end{pmatrix} \tag{19}
$$

$$h\_t = \begin{bmatrix} \stackrel{\rightarrow}{h\_t} \oplus \stackrel{\leftarrow}{h\_t} \end{bmatrix} \tag{20}$$

where <sup>⊕</sup> denotes element-wise sum between the forward hidden state <sup>→</sup> *ht* and the backward hidden state <sup>←</sup> *ht*. The output of the Bi-GRU layer is *H* = [*h*1, *h*2, ··· , *hn*], where *n* is the length of the input sentence.

Although the sentence representation obtained by the Bi-GRU layer maintains the sequence information of the sentence, it is not flexible enough to predict the sentiment polarity of the input sentence. To alleviate this problem, the SDNN model feeds the output of the Bi-GRU layer into the CNN layer. This is because the CNN has the ability of recognizing the local features inside a multi-dimensional field. Specifically, the CNN layer consists of a one-dimension convolutional layer and a max-pooling layer. First, we can treat *H* = [*h*1, *h*2, ··· , *hn*] ∈ *Rn* <sup>×</sup> *<sup>d</sup>*, where *hi* represents the *i th* word in the sentence with *d* dimension, as an "image" like in Figure 1, so the one-dimension convolutional layer can slide along the word dimension to convolve the matrix *H* with multiple kernels of different widths. The convolutional operation can be represented as follows:

$$c\_{\bar{i}} = f\left(H\_{\bar{i}\ \bar{x}+K} \circ W\_{\mathfrak{c}} + b\_{\mathfrak{c}}\right) \tag{21}$$

where ◦ denotes the dot product operation, *K* represents the width of convolutional kernel, *f* is a non-linear function such as ReLU, *Wc* is the convolutional matrix, and *bc* is the bias term. Each kernel corresponds to a linguistic feature detector which extracts a specific pattern of n-gram at various granularities [10]. The convolutional kernel is applied to each possible region of the matrix *H* to produce a feature map *C* = [*c*1, *c*2, ··· , *cnK* ] for the same width of convolutional kernel, where *nK* is the number of convolutional kernels. Then for each *ci* in *C*, the max pooling layer extracts the maximum value from the generated feature map:

$$p\_i = down \ n \ (c\_i) \tag{22}$$

where *down* (·) represents the max pooling function. Through this way, the pooling layer can extract the local dependency within different regions to keep the most salient information, and it results in a fixed-size vector whose size is equal to *nK*. Finally, the output of the max pooling layer with different widths is concatenated to form the final sentence representation *<sup>S</sup>*<sup>∗</sup> <sup>=</sup> *<sup>p</sup>*1, *<sup>p</sup>*2, ··· , *pnK* <sup>×</sup> *nwid* , where *nwid* denotes the number of different width.

In the deep neural network module, the sequence information generated by Bi-GRU and the local feature captured by CNN are integrated with an effective strategy of combining Bi-GRU and CNN, which can help the sentence classifier to predict the sentiment polarity.

#### *2.3. Sentence Classifier Module*

For text sentiment classification, the final sentence representation *S*∗ of the input text *S* is fed into a softmax layer to predict the probability distribution of sentence sentiment label over *C* (number of sentiment category labels), and the sentiment label with the highest probability is selected as the final sentiment category to which the sentence belongs. The function is shown as follows:

$$\widetilde{y} = \frac{\exp\left(\mathcal{W}\_o^T \,\mathrm{s}^\* + b\_o\right)}{\sum\_{i=1}^C \exp\left(\mathcal{W}\_o^T \,\mathrm{s}^\* + b\_o\right)}\tag{23}$$

$$y\_{pre} = \arg\max \overline{y} \tag{24}$$

where *<sup>y</sup>* is the predicted sentiment distribution of the sentence, *ypre* is the selected sentiment label, *<sup>W</sup><sup>T</sup> o* and *bo* are the parameters to be learned.

In order to train the SDNN model, we adopt the categorical cross-entropy loss function as the reasonable training objective which is minimized in the training process:

$$J(\theta) = -\sum\_{i=1}^{N} \sum\_{j=1}^{C} y\_i^j \log \overleftarrow{y\_i}^j + \lambda\_I(\sum\_{\theta \in \Theta} \theta^2) \tag{25}$$

where *N* is the training set, *y* is the one-hot distribution of the ground truth, λ*<sup>r</sup>* is the coefficient for the *L*<sup>2</sup> regularization, and Θ is the parameter set which includes all the parameters that need to be optimized during the training process. All the parameters are updated by the stochastic gradient descent strategy, which is defined as:

$$
\Theta = \Theta - \lambda\_l \frac{\partial \mathcal{J}(\theta)}{\partial \Theta} \tag{26}
$$

where λ*<sup>l</sup>* is the learning rate. The specific hyper-parameter settings are described in Section 3.2.

#### **3. Experiments**

#### *3.1. Datasets*

We conduct experiments on two publicly available datasets. The movie review (MR) dataset is a movie review dataset with one sentence per review collected by Pang and Lee [21], which aims to detect positive or negative reviews. The MR dataset has 5331 negative samples and 5331 positive samples. The Stanford Sentiment Treebank (SST) dataset is an extension of the MR by Socher et al. [22], which is manually separated from the train, development, test sets, and contains fine-grained sentiment labels (very positive, positive, neural, negative, very negative). Similar to the sample distribution of the MR dataset, the number of instances of each class in the SST dataset is approximately equal, which can effectively avoid the reduction of model generalization ability caused by the uneven distribution of dataset samples in the process of model training. In particular, since the MR dataset lacks a development set, we randomly sampled 10% of the training data as the development set. The detailed dataset statistics are shown in Table 1.

**Table 1.** The summary statistics of the two datasets. c: number of target classes, l: average sentence length, m: maximum sentence length, train/dev/test: train/development/test set size, |V|: vocabulary size, |Vpre|: number of words present in the set of pre-trained word embeddings, CV: 10-fold cross validation.


## *3.2. Implementation Details*

In order to improve the quality of the dataset, we pre-processed the text data by removing stopwords (e.g., "in", "of", "from") and punctuation. Then, all word embeddings from the text data were initialized by 200-dimensional GloVe vectors pre-trained by Pennington et al. [18]. For the out-of-vocabulary words, we randomly sampled their embeddings from a uniform distribution *U* (−0.1, 0.1). Some works adopted the fine-tuned training strategy for word vectors to improve the performance for text sentiment classification tasks [23]. In contrast, with the purpose of better reflecting the generalization ability of the model, we preferred to use the general embeddings for all datasets. What is more, we treated all the context words as sentiment resource words to implement the self-attention mechanism as if there was no sentiment resource word in the sentence.

For the deep neural network, the hidden states of the GRU unit in each layer were set to 200. In the convolution layer, we employed 1D convolutional filter windows of 3, 4, and 5 with 100 feature maps each, and a 1D pooling size of 4.

During the training process, we optimized the proposed model with the AdaDelta algorithm [24] by following the learning rate of 10−<sup>2</sup> and the mini-batch size of 32. To alleviate the overfitting problem, we employed the dropout strategy [25], with a dropout rate of 0.5 for the Bi-GRU layer, 0.2 for the penultimate layer, and 10−<sup>5</sup> for the coefficient λ*<sup>r</sup>* of *L*<sup>2</sup> regularization. To evaluate the performance of the sentiment classification task, we used Accuracy and F1 as the metrics.
