Topic-Based Document-Level Sentiment Analysis Using Contextual Cues

Truică, Ciprian-Octavian; Apostol, Elena-Simona; Șerban, Maria-Luiza; Paschke, Adrian

doi:10.3390/math9212722

Open AccessArticle

Topic-Based Document-Level Sentiment Analysis Using Contextual Cues

by

Ciprian-Octavian Truică

^1,*,†

,

Elena-Simona Apostol

^1,*,†

,

Maria-Luiza Șerban

^1,†

and

Adrian Paschke

²

¹

Computer Science and Engineering Department, Faculty of Automatic Control and Computers, University Politehnica of Bucharest, RO-060042 Bucharest, Romania

²

Fraunhofer Institute for Open Communication Systems, 10589 Berlin, Germany

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2021, 9(21), 2722; https://doi.org/10.3390/math9212722

Submission received: 28 September 2021 / Revised: 18 October 2021 / Accepted: 24 October 2021 / Published: 27 October 2021

(This article belongs to the Special Issue Advanced Aspects of Computational Intelligence with Its Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Document-level Sentiment Analysis is a complex task that implies the analysis of large textual content that can incorporate multiple contradictory polarities at the phrase and word levels. Most of the current approaches either represent textual data using pre-trained word embeddings without considering the local context that can be extracted from the dataset, or they detect the overall topic polarity without considering both the local and global context. In this paper, we propose a novel document-topic embedding model, DocTopic2Vec, for document-level polarity detection in large texts by employing general and specific contextual cues obtained through the use of document embeddings (Doc2Vec) and Topic Modeling. In our approach, (1) we use a large dataset with game reviews to create different word embeddings by applying Word2Vec, FastText, and GloVe, (2) we create Doc2Vecs enriched with the local context given by the word embeddings for each review, (3) we construct topic embeddings Topic2Vec using three Topic Modeling algorithms, i.e., LDA, NMF, and LSI, to enhance the global context of the Sentiment Analysis task, (4) for each document and its dominant topic, we build the new DocTopic2Vec by concatenating the Doc2Vec with the Topic2Vec created with the same word embedding. We also design six new Convolutional-based (Bidirectional) Recurrent Deep Neural Network Architectures that show promising results for this task. The proposed DocTopic2Vecs are used to benchmark multiple Machine and Deep Learning models, i.e., a Logistic Regression model, used as a baseline, and 18 Deep Neural Networks Architectures. The experimental results show that the new embedding and the new Deep Neural Network Architectures achieve better results than the baseline, i.e., Logistic Regression and Doc2Vec.

Keywords:

document-level Sentiment Analysis; document-topic embeddings; Topic Modeling; Deep Learning Architectures

1. Introduction

Opinion Mining and Sentiment Analysis are related research topics, at the intersection of Machine Learning and Natural Language Processing, that, recently, have been studied intensively [1,2,3,4,5,6]. The interest in these related topics is due to the wide range of applications where they can be used (e.g., advertising, politics, business, etc.) and the availability of large amounts of textual data. They are generally used to identify opinions and recognize the sentiments expressed, as well as the general polarity of a text, e.g., subjective or objective, positive or negative. The data sources that are mostly used in Opinion and Sentiment Analysis tasks are represented by blogs, posts from social media, comments from movie and product reviews sites or new articles [7]. These can be used to complete different tasks, such as emotion detection and sentiment classification.

Various types of neural networks have been employed to solve more accurately specific Opinion and Sentiment Analysis tasks, e.g., Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs). RNNs have been proven to offer good results for text analysis tasks [8]. Different types of RNNs, such as GRU (Gated Recurrent Unit) or LSTM (Long Short Term Memory), were developed to overcome the flaws of other feed-forward Perceptron-based neural networks. RNNs can capture information about the input, such as context dependency between words, and share parameters across epochs. CNNs [9,10] are a type of feed-forward networks very popular due to the minimal preprocessing requirement. These types of networks are regarded as more powerful than RNNs. Although CNNs are ideal for image processing and their accuracy is dependent on the initial parameter tuning, they turned out to also bring increased performance in text processing, especially combined with other neural networks.

The main motivation of this paper is to improve the accuracy of document-level Sentiment Analysis (Definition 1) using Deep Learning models which employ contextual cues (Definition 2). Thus, we aim to introduce specific/local (Definition 3) and general/global (Definition 4) contextual cues by employing word embeddings (WordEmbs) and Topic Modeling in order to improve the accuracy of polarity detection. Thus, to improve the context of Sentiment Analysis, we enhance document embedding (Doc2Vec) using contextual cues through the use of different WordEmbs, which adds local context by training the embeddings on the documents within a set of documents D, and Topic Modeling algorithms, which adds global context by extracting topics from the set of documents D. To add context, we employ Doc2Vecs and topic embeddings (Topic2Vecs) to create a new embedding, i.e., DocTopic2Vec, as a concatenation between a Doc2Vec and a Topic2Vec.

Definition 1

(Document-level Sentiment Analysis). Document-level Sentiment Analysis is the task used to determine for a document

d_{i}

belonging to a set of documents D whether its text has a positive, neutral, or negative polarity.

Definition 2

(Contextual cues). The contextual cues consist of the local and global lexical, semantic, and syntactic information of a word

w_{i}

given to a Machine/Deep Learning model to solve a task t. (Note: We use both local and global context for document-level Sentiment Analysis.)

Definition 3

(Local Context). The local context refers to the local lexical, semantic, and syntactic information of a word

w_{i}

within a document d. (Note: We extract the local context by training word embeddings for each word

w_{i} \in d

. Thus, the embedding encodes the local context by preserving the word’s lexical, semantic, and syntactic similarity as well as its relation with other words within the same document).

Definition 4

(Global Context). The global context refers to the global lexical, semantic, and syntactic information of a word

w_{i}

within a set of documents D. (Note: We extract the global context by detecting the most important topic for each document

d_{i} \in D

. Thus, documents belonging to the same topic also belong to the same context and the context given by a topic is seen as a global context for the document belonging to this topic).

A Doc2Vec is constructed as the average of the WordEmbs for the terms in the document. This embedding manages to preserve the contexts and semantics of words at the document level [11]. WordEmbs add semantic context by encoding the position for words in a sentence before vectorizing the text. We use five WordEmb: (1) Word2Vec CBOW (Continuous Bag-of-Words) model; (2) Word2Vec Skip-Gram model; (3) FastText CBOW model; (4) FastText Skip-Gram model, and (5) GloVe model. Word2Vec captures the context of a word in a document and the relationship with the words surrounding it. Furthermore, this embedding manages to encode the semantic and syntactic similarity of the words within the document. Word2Vec uses two models to determine the local context: CBOW and Skip-Gram. The CBOW model predicts the word’s individual context by taking into account the context of all the words within the corpus. The Skip-Gram takes a word and determines the words that are in the same context. FastText extends Word2Vec by learning embedding vectors for the n-grams that are found within each word. FastText also uses CBOW and Skip-Gram models. GloVe enhances the local context information of words using global statistics, i.e., word co-occurrence.

We use Topic Modeling to extract the hidden latent semantic patterns and to add a general context to Sentiment Analysis by detecting and grouping document with similar characteristics by the subjects of interest. We employ different Topic Modeling algorithms, i.e., Latent Dirichlet allocation (LDA) [12], Non-Negative Matrix Factorization (NMF) [13], Latent Semantic Indexing (LSI) [14]. We encode these hidden patterns that add a general context to Sentiment Analysis into Topic2Vecs. Topic2Vecs are built as the average between the topics top-k terms’ relevance and their WordEmbs. By employing Topic2Vecs, we manage to encode context-based document grouping and to enhance each document’s context by constructing the DocTopic2Vec using the dominant topic as a concatenation between each document’s Doc2Vec and Topic2Vec. Thus, documents that are similar in meaning and context, including polarity and opinion, will be closer to each other in the vector space than texts which are not necessarily related.

For the experiments, we use a large dataset consisting of game reviews. We create the DocTopic2Vecs using the discussed WordEmbs and Topic Modeling algorithms. Each DocTopic2Vec embedding is used in classification tasks that apply Logistic Regression (LogReg) and neural networks with LSTM, GRU, Bidirectional, Dense, and CNN layers. We also design six news Convolutional-based (Bidirectional) Recurrent Deep Neural Network (CNN-(Bi)RNN) Architectures for the task of determining accurate document-level polarity. The results of our benchmark show that the accuracy is improved by about 5% when adding Doc2Vec contextual clues with NMF and LSI Topic Modeling algorithms, compared to the baseline, i.e., Doc2Vec-based LogReg Sentiment Analysis. Furthermore, the proposed new architectures outperformed the state of the art solution proposed in [3].

The main research questions we want to answer are:

( $Q_{1}$ ): Does a Topic Modeling approach improve the overall accuracy of detecting the polarity of textual data?
( $Q_{2}$ ): Can local context added by Word Embeddings and global context added by Topic Modeling improve the accuracy of the Sentiment Analysis task?
( $Q_{3}$ ): Can a novel CNN-(Bi)RNN architecture prove to be a better model for the Sentiment Analysis task?

Thus, by answering these questions, the main objective of this work is three-fold:

( $O_{1}$ ): Analyze the impact of Topic Modeling on the Sentiment Analysis task;
( $O_{2}$ ): Construct a novel embedding DocTopic2Vec that encapsulates both local and global context in order to improve the accuracy of detecting the polarity of textual data;
( $O_{3}$ ): Build a novel CNN-(Bi)RNN architecture to increase the accuracy of the Sentiment Analysis task.

This paper is structured as follows. In Section 2, we discuss the current advancement in Sentiment Analysis techniques. Section 3 presents the proposed architecture and describes each component module, together with the used algorithms and techniques. In Section 4, we describe the dataset and our set of experiments. Finally, we analyze and interpret the results. Section 5 is drawing the final conclusions and provides several future directions.

2. Related Work

Sentiment Analysis approaches can be classified into three categories: Machine Learning, Lexicon-based, and Hybrid [15]. Furthermore, these techniques are divided, based on the granularity level, in word (or aspect), sentence (or short text), and document (or long text) level.

There are not many solutions focusing on context-based Sentiment Analysis models. A context enrichment model for Sentiment Analysis is proposed in [4]. The authors add several processing steps, prior to sentiment classification, in order to augment the dataset with context. One important step discussed here is the prior-polarity identification with SentiWordNet. Unfortunately, the authors do not clearly specify what are the advantages of prior-polarity identification, and their model is just conceptual without any real experiments.

Most of the related previous works primarily use either only embeddings as text representation that are incorporated into the Sentiment Analysis model (e.g., [2,3]) or they consider Topic Modeling for determining the opinion by topic, and not to add context to the model (e.g., [16,17]).

In [3], the authors propose a Deep Learning 4CNN-BiLSTM model for document-level Sentiment Analysis. Their model consists of four CNN layers and one BiLSTM layer. For the experiments, they use a relatively small amount of documents, i.e., 2003 articles from French newspapers. They employ two optimizers, SGD and Adam, and Word2Vec as WordEmbs solution. The proposed model is compared with CNN, LSTM, BiLSTM, and CNN-LSTM, and they conclude that it achieves the best accuracy. Although they obtained a high accuracy for the 4CNN-BiLSTM model, the results are not conclusive, as the experiments are performed on a small dataset. In our experiments, we also analyze their model, both the version proposed by them and also by adding Topic Modeling.

Attention mechanisms condition the Sentiment Analysis model to pay attention to the features which contribute the most to the task. The authors of the paper [18] propose a model based on LSTM layers with an attention mechanism. They used different approaches for the attention mechanism, i.e., convolution-based and pooling-based attention mechanism, and the word-vectors used for training, i.e., pre-trained word vectors from Word2Vec and randomly initialized word-vectors. Their model obtained better results than baseline methods on two out of three datasets. Attention-based Bidirectional CNN-RNN Deep Model (ABCDM) [19], another attention-based solution, use independent BiLSTM and GRU layers to extract both past and future contexts and an attention mechanism to put more or less emphasis on different words. To reduce the dimensionality and create new feature representations, the ABCDM model utilizes both convolutional layers and pooling techniques. This model achieves state-of-the-art performance when compared with other Neural Network architectures for the task of Sentiment Analysis on reviews and Twitter datasets.

An improved method for generating WordEmbs used in Sentiment analysis is proposed in [2]. This method, Improved Word Vectors, uses Part-of-Speech, lexicon-based, and word position techniques together with Word2Vec or GloVe models. The performance of the proposed solution is tested using four different Deep Learning models and benchmark sentiment datasets. The results show that when using these embeddings, the accuracy of the model is slightly increased.

One solution that uses Topic Modeling for sentiment detection is presented in [17]. The authors combine shrinkage regression and Topic Modeling for detecting polarity in a Twitter dataset. The proposed model consists of two stages. In the first stage, they detect the polarity of the tweets using two shrinkage regression models. This type of regression adds a penalty in the way the loss function is calculated for models that have too many variables. During the second stage, the relevant topics are identified using LDA. The model estimates the sentiment of each topic using term sentiment scores.

Topic Modeling and WordEmbs have been used together to analyze the sentiment of topics. However they have never been applied in Sentiment Analysis at the document level, as we propose in this paper. This approach is used for aspect-based topic Sentiment Analysis [20,21]. In this case, Topic Modeling is used for aspect extraction and categorization without considering the global context. In [21], the authors combine domain-trained WordEmb and Topic Modeling for categorizing aspect-terms from online reviews. Their proposed model uses continuous WordEmb and LDA algorithm. The model is tested using a small dataset, i.e., the restaurant reviews from the SemEval-2014 dataset consisting of 3841 sentences. One important limitation of their model is that it has a longer convergence time than the standard model and has lower performance than supervised models.

Several recent works also explore pre-trained language models for the Sentiment Analysis task, e.g., BERT [22], RoBERTa [23], ALBERT [24]. In [25], BERT is compared with an LSTM-based architecture and achieves an overall better f-measure. In [26], a RoBERTa Sentiment Analysis model is combined with key entity detection, based on the presumption that people are more prone to observe negative information. This approach improves the accuracy of the Sentiment Analysis task when compared with architectures consisting of BERT or RoBERTa transformers combined with SVM, LR, or NBM.

D I C E_{T}

[1] is another transformer-based method for sentiment analysis. The novelty of

D I C E_{T}

is that it enhances the data quality by handling noises within contexts. For this, it uses six types of embeddings, i.e., character embeddings, GloVe, Part-of-Speech embeddings, Lexicon embeddings, ELMo [27] and BERT-based embeddings. The concatenated embeddings are fed to a BiLSTM network with attention.

D I C E_{T}

has higher performance compared with Sentiment Analysis methods that use the standard one-type of embeddings, e.g., Glove or Word2Vec, or other pre-processing methods, e.g., TFIDF.

3. Methodology

Figure 1 presents the proposed architecture for our topic-based Sentiment Analysis using a contextual cues model.

The Data Preprocessing module cleans and transforms the textual data to make them suitable for analysis. The Word Embedding and TFIDF Vectorization modules encode the documents’ words into vector representations. The Document Embedding module computes a vector for each document based on the Word Embedding. The Topic Modeling uses the TFIDF document vectorization to extract the topics and the most relevant keywords. The Topic Embedding module constructs the vector representation of topics using word embedding. The Document-Topic Embedding module computes the new context enhanced document embeddings using the topic and document embeddings that add bot semantic and syntactic context to the vector representation. The classification module uses the new document-topic embeddings to classify documents and extract their polarity. The Evaluation module uses different metrics to determine the accuracy of the classification and determine the quality of the resulting models.

3.1. Data Preprocessing Module

The preprocessing step is important because the text written by people can contain misspelled words, symbols, abbreviations etc. that need to be removed or replaced to facilitate the execution of the subsequently tasks with greater accuracy [28]. The initial text is preprocessed using the following steps:

(1): The text is cleaned by removing all JavaScript functions, HTML tags, and URL;
(2): The contractions are expanded;
(3): The named entities are extracted while the rest of the text is lemmatized;
(4): The punctuation and stop words excluding negations (i.e., no, not, etc.) are removed;
(5): The text is transformed to lowercase and then split into tokens;
(6): The tokens that have a length greater than 3 or are negations are kept. Using this aggressive text preprocessing improves the algorithms’ time performance, as the vocabulary is minimized to the essential tokens without excluding the terms which impact the polarity.

3.2. Word Embedding Module

The word embedding models used in this paper are Word2Vec, FastText, and GloVe. Each embedding model (WordEmb) generates word representations in a vector space. The context of each word within a document is captured when employing these embeddings. Moreover, these models also encode both the relationship and the similarity between words from a semantic and syntactic perspective.

3.2.1. Word2Vec

Word2Vec represents a textual dataset as a set of vectors and outputs a vector space [29]. The context similarity of a word within the dataset is determined by measuring the distance between the corresponding vectors in this space. Word2Vec use either the Continuous Bag-Of-Words (CBOW) or Skip-Gram model to create the representation of words.

The CBOW model utilizes the context of a word as input and attempts to predict the word itself. The input layer of the model is represented by the one-hot encoded vectors corresponding to each context words. The average of the vectors from this layer is used to compute the input for the hidden layer. The weighted sum of the inputs, computed by the hidden layer, is sent to the next layer. The hidden layer sends the weighted sum of the inputs to the next layer. Each terms’ probability value is computed by the network’s last layer and is given as a final result in the form of a vector.

The Skip-Gram model, as opposed to CBOW, starts with the word as input and tries to generate its context. The input layer is the target word vector, while the output layer consists of the vectors with the probability values of the words appearing in the context of a target word. The hidden layer sends the weighted input to the following layer. The Skip-Gram model is generally used to discover the semantic similarity between words. Therefore, if two words have a similar context, these words might also have a similar semantic.

3.2.2. FastText

FastText is an unsupervised algorithm that uses the CBOW and Skip-Gram models for learning word embeddings [30]. This embedding is considered an extension of Word2Vec as it follows a similar approach [31]. The difference is that the word is not considered the basic unit, but a bag of character n-grams. This facilitates better accuracy and a faster training time compared to Word2Vec.

3.2.3. GloVe

GloVe (Global Vectors) is an unsupervised model applied for learning word embeddings [32]. In comparison to the other models described, i.e., Word2Vec, FastText, GloVe consider both local and global statistics of word–word co-occurrences in the corpus to obtain the vector representations of the words. It uses a term co-occurrence matrix that stores, for each word, the frequency of its appearance in the same context with another word. GloVe captures the relationship between words by using the ratio of co-occurrence probability. Using the co-occurrence probability ratio, it extracts information from all the word vectors and identifies word analogies or synonyms within the same contexts.

3.3. Document Embedding Module

The document embeddings Doc2Vec (Equation (1)) are generated for each document

d_{i}

in the dataset by adding the word embeddings for all the terms t (WordEmb(t)) in the document and divide the sum by the number of terms in the document (

m_{i}

). We build a Doc2Vec for each WordEmb we previously discussed.

DOC 2 VEC (d_{i}) = \frac{\sum_{t \in d_{i}} WORDEMB (t)}{m_{i}}

(1)

3.4. TFIDF Vectorization

The TFIDF (term frequency-inverse document frequency) Vectorization module uses a bag-of-word approach to vectorize the news articles given:

(1): A textual corpus $D = {d_{i} | i = \bar{1, n}}$ of size $n = | | D | |$ that contains documents $d_{i}$ ;
(2): A vocabulary $V = {t_{1}, \dots, t_{m}}$ of size $m = | | V | |$ that contains the unique words or terms $t_{j}$ in the dataset D.

A document

d_{i} \in D (i = \bar{1, n})

of length

m_{i} (\neq m)

is a multi-set of V, i.e.,

d_{i} = (V, f) = {t_{1}^{f (t_{1}, d_{i})}, \dots, t_{m}^{f (t_{m}, d_{i})}}

with

f (t_{j}, d_{i}) \geq 0

the multiplicity (co-occurrences) function which denotes the number of times

t_{j}

appears in document

d_{i}

. For simplicity, we will denote

t_{j}^{f (t_{j}, d_{i})}

as

t_{i j}

, thus,

d_{i} = {t_{i 1}, \dots, t_{i m}}

.

TFIDF (Equation (2)) is defined using:

(1): Term frequency $TF (t_{j}, d_{i})$ (Equation (3)) that computes the co-occurrences $f (t_{j}, d_{i})$ of a term $t_{j} \in V$ in a document $d_{i}$ ;
(2): The inverse-document frequency $IDF (t_{j}, D)$ (Equation (4)) which uses the number document $n_{j}$ where a term $t_{j} \in V$ appears to penalize frequent terms that bring no information gain;
(3): The normalization factor $ℓ^{2} (d_{i})$ (Equation (5)) to normalize TFIDF in the range $[0, 1]$ .

T F I D F (t_{j}, d_{i}, D) = \frac{T F (t_{j}, d_{i}) \cdot I D F (t_{j}, D)}{ℓ^{2} (d_{i})}

(2)

T F (t_{j}, d_{i}) = f (t_{j}, d_{i})

(3)

I D F (t_{j}, D) = {log}_{2} \frac{n}{n_{j}}

(4)

ℓ^{2} (d_{i}) = \sqrt{\sum_{j = 1}^{m} {(T F (t_{j}, d_{i}) \cdot I D F (t_{j}, D))}^{2}}

(5)

Using the term weights, we can construct a document–term matrix

A = {w_{i j} | i = \bar{1, n} \land j = \bar{1, m}}

, where rows correspond to documents and terms to columns. The cell value

w_{i j}

is the weight (e.g.,

T F

,

T F I D F

, etc.) of term

t_{j}

in document

d_{i}

.

3.5. Topic Modeling Module

This module utilizes statistical unsupervised methods to extract hidden latent semantic patterns within our dataset. We use the following models for this module.

This module utilizes statistical unsupervised Machine Learning methods, i.e., Topic Modeling, to extract hidden latent semantic patterns within our dataset. We use three generative statistical models for this module, i.e., Latent Dirichlet allocation (LDA) [12], Non-Negative Matrix Factorization (NMF) [13], Latent Semantic Indexing (LSI) [14], also known as Latent Semantic Analysis (LSA). The Topic Modeling algorithms use the document–term matrix A as input.

3.5.1. Latent Dirichlet Allocation

Latent Dirichlet allocation (LDA) is a probabilistic model that groups various terms with similar meaning that represent the same notions [12]. It is one of the most popular Topic Modeling approaches [33]. LDA algorithm relies on the assumption that random mixtures over latent topics can be used to generate documents. In this context, each topic is described by a multinomial distribution over the unique terms in the vocabulary. Thus, we can generate documents using techniques such as Gibbs that samples <

t o p i c, w o r d s

> pairs from a random mixture.

For k topics and a corpus of n documents

D = {d_{i} | i = \bar{1, n}}

where each document

d_{i}

is a sequence of

m_{i}

words

t_{j} \in V

j = \bar{1, m}

modeled as Poisson distributions, i.e.,

m_{i} \sim P o i s s o n (ξ)

LDA uses the following process:

(1)

Determine a distribution of topics

θ_{i}

for each document

d_{i}

;

(2)

Determine a distribution of words

φ_{κ}

in a topic

κ \in \bar{1, k}

;

(3)

For each word

t_{j}

in document

d_{i}

:

(a): Determine a topic $z_{i j}$ ;
(b): Determine a word $t_{j}^{i}$ .

The distribution of topics in document

d_{i}

is a Dirichlet distribution over the number of topics

θ_{i} \sim D i r i c h l e t_{k} (α)

where

θ_{i} = {θ_{i κ} | i = \bar{1, n} \land κ = \bar{1, k} \land \sum_{κ = 1}^{k} θ_{i κ} = 0}

is a k-dimensional vector of probabilities,

θ_{i κ}

is the probability of topic

κ

occurring in document

d_{i}

, and

α = {α_{1}, α_{2}, \dots, α_{κ}}

is a k-dimensional vector of positive reals

α_{k} > 0

.

The distribution of words in topic

κ

is also a Dirichlet distribution over the vocabulary

φ_{κ} \sim D i r i c h l e t_{m} (fi)

where

φ_{κ} = {φ_{κ j} | κ = \bar{1, k} \land j = \bar{1, m} \land \sum_{κ = 1}^{k} θ_{i κ} = 0}

is a m-dimensional vector of probabilities,

φ_{κ j}

is the probability of a word probability of word

t_{j}

occurring in topic

κ

, and

β = {β_{1}, β_{2}, \dots, β_{m}}

is a m-dimensional vector of positive reals

β_{j} > 0

.

For each document

d_{i}

(

i = \bar{1, n}

), we define

z_{i κ}

described by a set of words

t_{κ j}

(

j = \bar{1, m}

) of size

m_{i}

. Both

z_{i κ}

and

t_{κ j}

are multinomial distributions, i.e.,

z_{κ j} = M u l t i n o m i a l_{k} (θ_{i})

and

t_{κ j} = M u l t i n o m i a l_{m} (φ_{z_{i κ}})

.

3.5.2. Non-Negative Matrix Factorization

Non-Negative Matrix Factorization (NMF) is a dimensionality reduction paradigm based on linear algebra [34]. Experimental results prove that NMF is the best choice for extracting topics [35]. It is constructed on the premises that a matrix can be created as a product of two non-negative matrices. Thus, NMF factorizes a matrix

A \in R^{n \times m}

into two non-negative matrices

W \in R^{n \times k}

and

H \in R^{k \times m}

. With regard to Topic Modeling, these matrices have the following signification:

(1): A is a document–term matrix constructed using weighted term frequencies for a corpus containing n documents and a vocabulary of size m terms;
(2): W is the document–topic matrix that assigns a document membership to each topic k;
(3): H is the topic–term matrix that assigns to each topic k the importance of a term.

To determine W and H, the objective function

F (W, H)

must be minimized by respecting the constraint that all the elements of W and H are non-negative. Equation (6) presents the objective function, where

| | \cdot {| |}_{F}

is the Frobenius norm.

F (W, H) = | | A - {W H | |}_{F}^{2} = \sum_{i = 1}^{n} \sum_{j = 1}^{m} {(A_{i j} - {(W H)}_{i j})}^{2}

(6)

To minimize the objective function (Equation (7)), the values of W and H are updated iteratively (with

τ

the index of the iteration) until they stabilize (Equation (8)).

min_{W \geq 0, H \geq 0} F (W, H) = min_{W \geq 0, H \geq 0} | | A - {W H | |}_{F}^{2}

(7)

\begin{matrix} H_{i j}^{τ + 1} & \leftarrow H_{i j}^{τ} \frac{{({(W^{τ})}^{T} A)}_{i j}}{{({(W^{τ})}^{T} W^{τ} H^{τ})}_{i j}} \\ W_{i j}^{τ + 1} & \leftarrow W_{i j}^{τ} \frac{{(A {(H^{τ + 1})}^{T})}_{i j}}{{(W^{τ} H^{τ + 1} {(H^{τ + 1})}^{T})}_{i j}} \end{matrix}

(8)

3.5.3. Latent Semantic Indexing

Latent Semantic Indexing (LSI) tries to solve the problem of synonyms by identifying terms that statistically appear together. The algorithm’s main consideration is that the randomness of word choice within documents hides an underlying latent semantic structure. To determine this latent structure, LSI employs the matrix factorization technique called Singular Value Decomposition (SVD). It identifies syntactical different but semantically similar terms using a structure called hidden “concept” space.

Given the document–term matrix A with the size

n \times m

(n is number of documents, m is the number of terms in the vocabulary), LSI uses SVD to interactively factorize A into a product of three matrices, i.e.,

A = U Σ V^{T}

.

(1): U is an $n \times k$ matrix that denotes the document–topics association. The columns of U are the eigenvectors $u$ of $A A^{T}$ . Thus, these vectors identify the k non-zero eigenvalues $Σ_{L} = d i a g (σ_{1}, σ_{2}, \dots, σ_{k})$ of $A A^{T}$ . Moreover, $u$ are unit orthogonal vectors, i.e., $U^{T} U = I$ and are also called left singular values because they satisfy the condition $u A = Σ_{L} v$ .
(2): $V^{T}$ is an $k \times m$ matrix that denotes the topic–keywords association. The columns of V are the eigenvectors $v$ of $A^{T} A$ . Thus, these vectors identify the r non-zero eigenvalues $Σ_{R} = d i a g (σ_{1}, σ_{2}, \dots, σ_{r})$ of $A^{T} A$ . Moreover, $v$ are unit orthogonal vectors, i.e., $V^{T} V = I$ , and are also called right singular values because they satisfy the condition $A v = Σ_{R} v$ .
(3): $Σ$ is a $k \times k$ diagonal matrix which has on the diagonal the singular values or eigenvalues $σ_{i} > 0$ . Thus, this diagonal matrix is defined as $Σ = d i a g (σ_{1}, σ_{2}, \dots, σ_{k})$ where each value is sorted in decreasing order from the one that holds the highest value to the one that represents the smallest one, i.e., $σ_{1} \geq σ_{2} \geq \dots \geq σ_{k} > 0$ .

3.6. Topic Embedding Module

To encode the global context that is hidden in the latent semantic structures defined by the randomness of words, we employ a topic vector embedding Topic2Vec that encodes the keyword for the k topics extracted using one of the Topic Modeling algorithms. Topic2Vec takes the weighted average of the word embeddings WordEmb of each relevant term t belonging to the topic

z_{i}

(

i = \bar{1, k}

) and its probability distribution

p (t | z_{i})

within the topic

z_{i}

. Equation (9) presents the proposed encoding, where the number of keywords considered for a topic

z_{i}

is

n_{i}

. We build a Topic2Vec for each topic model and WordEmb we previously discussed.

TOPIC 2 VEC (z_{i}) = \frac{\sum_{t \in z_{i}} WORDEMB (t) \cdot p (t | z_{i})}{n_{i}}

(9)

3.7. Document-Topic Embedding Module

The document with topics embeddings DocTopic2Vec (Equation (10)) are generated by concatenating (operator ⊕) the Topic2Vec of the most dominant topic of a document with the document’s Doc2Vec. We build a DocTopic2Vec for each Topic2Vec we previously discussed using the same WordEmb for both the Doc2Vec and the Topic2Vec. By concatenating the Doc2Vec with Topic2Vec and obtaining the DocTopic2Vec we manage to encode the local context given by the document embedding (Doc2Vec) with the global context given by the topic embedding (Topic2Vec).

DOCTOPIC 2 VEC (d_{i}, z_{i}) = DOC 2 VEC (d_{i}) \oplus TOPIC 2 VEC (z_{i})

(10)

3.8. Classification Module

For classification, we use the Logistic Regression (LogReg) algorithm, which serves as a baseline, and multiple Deep Neural Network (DNN) Architectures.

3.8.1. Logistic Regression

Logistic Regression (LogReg) is a classification algorithm successfully used, in many cases, as a baseline for the Sentiment Analysis task to predict the class in which an observation can be categorized [36,37]. The algorithm tries to minimize the error of the estimations made using the log-likelihood and to determine the parameters that produce the best estimations using gradient descent [38]. The log-likelihood functions guarantee that the gradient descent algorithm can converge to the global minimum.

3.8.2. Deep Neural Network

Deep Neural Network (DNN) Architectures are used to classify the textual data and extract the polarity at the document level using Doc2Vec and DocTopic2Vec. These architectures are developed using different fully connected or convolutional layers. The neural network units that make up these layers are Perceptron, Gated Recurrent Unit (GRU), and Long Short-Term Memory (LSTM), Bidirectional GRU (BiGRU), and Bidirectional LSTM (BiLSTM). Figure 2 presents the combinations we use between these layers to create 17 DNN Architectures.

A Perceptron is a processing unit used to predict the label of an observation

\hat{y} = {a r g m a x}_{y} f (x, y) \cdot w

. The function

f (x, y)

is used to map all the possible feature representation <

x, y

> pairs to a new feature vector

x

and multiplies them by a weight vector

w

. The

x

vector must fulfill the following conditions: (1) it has a positive number of elements, and (2) the values of its elements are real value numbers.

GRU is a recurrent unit that has two gating mechanisms: (1) the update gate, and (2) the reset gate. The update gate is used as both the forget gate and the input gate. The reset gate determines what percentage of the previous hidden state contributes to the candidate state of the new step. Furthermore, the GRU has only one state component, i.e., the hidden state.

LSTM is a recurrent unit that uses in its design two components to represent its state: (1) the hidden state is given by a short-term memory component, and (2) the current cell state is achieved by the long-term memory component. The LSTM unit comprise of a gating mechanism with three gates and a memory cell. The gating mechanism has the following gates: (1) the input gate, (2) the forget gate, and (3) the output gate. LSTM controls the gradients’ values and avoids the problems of vanishing and exploding gradients by using the forget gate and the properties of the additive functions which compose the cell state gradients.

Bidirectional RNN (BiRNN) units allow for the use of information from both the previous and next state to make predictions about the current state. We use both BiGRU and BiLSTM in our models.

Dense layers are regular deeply connected neural network layers that contain only Perceptron units.

CNN are Deep Neural Networks containing multiple convolution hidden layers that apply a filter to the activation function. After a convolutional layer, it is customary to use a layer that employs a pooling mechanism. The pooling layer reduces the dimensions of the data returned by the convolutional layer. This reduction is achieved by combining the results of the previous layer into a single layer neuron. The output of this single layer neuron is then used as the input of the following layer

Considering these layers, we propose six new CNN-(Bi)RNN architectures: CNN-BiGRU, CNN-3GRU, CNN-3BiGRU, CNN-BiLSTM, CNN-3LSTM, and CNN-3BiLSTM. When multiple recurrent layers are used, they form a stacked architecture.

These architectures are designed as follows:

(1): The input layer that accepts the Doc2Vec or DocTopic2Vec;
(2): The CNN layer;
(3): MaxPooling;
(4): (Stacked) Recurrent layer(s), either BiLSTM or BiGRU;
(5): Dense layer containing Perceptron.

Moreover, we implement the DNN Architecture presented in [3]. We use the same configurations for this DNN as presented in the original work. In the experiments, we name this architecture 4CNN-BiLSTM.

3.9. Evaluation Module

Evaluation metrics are used to better understand the performance of a model and for fine-tuning the model on a given classification task. In our case, we are solving a multi-class classification problem where we are trying to determine the different polarities of a given text. Thus, we use the weighted accuracy measure for evaluating our models because it takes into account the distribution of classes within the dataset. The weighted accuracy

ω A

(Equation (11)) measures the per-class effectiveness of a classifier by employing the True/False Positive (

T P_{i}

and

F P_{i}

) and True/False Negative (

T N_{i}

and

T P_{i}

) rates. Given k classes

y_{i}

(

i = \bar{1, k}

) and a dataset with n observations where

n_{i}

observations are labeled with class

y_{i}

, then we can compute a weight

ω_{i}

for each class

y_{i}

using Equation (12).

ω A = \frac{1}{k} \sum_{i = 1}^{k} ω_{i} \cdot \frac{T P_{i} + T N_{i}}{T P_{i} + T N_{i} + F P_{i} + F N_{i}}

(11)

ω_{i} = \frac{k \cdot n_{i}}{n}

(12)

4. Experimental Results

4.1. Dataset

For the experiments, we used a game reviews dataset containing context textual data posted on the MetaCritic website (https://www.metacritic.com/, accessed on 28 September 2021). The original version of this dataset is presented in [39] and improved in [40]. From the dataset, we use only the reviews and the polarity assigned for each review, although the collected raw data also contain other information. The polarity was transformed from the initial string format (i.e., positive, neutral, negative) into integer format (i.e., 2, 1, 0). This dataset contains over 90,500 game reviews with a polarity assigned that can be −1 for negative, 1 for positive, or 0 for neutral. After preprocessing, we had with 90,165 with the following distribution of class: 15,721 negative, 22,433 neutral, and 52,011 positive. Out of the total number of 90,165 comments, 99.31% were in English, while 0.69% were in Spanish. As the number of comments in Spanish is negligible, we kept them to see if and how they impact our analysis. The vocabulary size is 23,016. The reviews contain from 1 to 1217 terms, with an average of 44.02. The reviews with a length between 1 and 50 words are the most common in the dataset, i.e., 66,713. The number of reviews with more than 100 words is 8538.

Experimentally, we have identified that the classification tasks perform better when the training and testing sets keep the proportions of the polarities of the entire dataset. For example, on a LogReg classification experiment, if the data are split poorly, e.g., mostly positive reviews are used in the training dataset, the accuracy is lower than 55%. If equal proportions are created, based on three-quarters of the initial dataset, the accuracy improves up to 67%, whereas if the dataset is split using the initial proportions, this results in approximately 71% accuracy. Therefore, we conducted the classification experiments using 80% of the dataset for training and 20% for testing, i.e., 72,132 reviews for training and 18,033 for testing. We preserved the polarity distribution of reviews in the both training and testing subsets. Moreover, we identified that the better the data are cleaned, i.e., as little as possible misspelled or foreign words are left in the dataset, the better the accuracy of the classification tasks is, with an increase of even 10% in accuracy compared to other data normalization methods.

4.2. Word Embedding

To identify the best size for each WordEmb, we tested various parameters and evaluated the resulting embeddings using a few approaches:

(1): Computing accuracy by identifying how well the model recognizes analogies; the test is performed using the questions-words dataset [41] that contains pairs of analogies from different domains;
(2): Identifying the cosine similarity between words with positive and negative connotation that appear in the dataset, i.e., (fun, enjoyable), (boring, dull), etc., and
(3): Checking the most similar words with a common word in the dataset.

We determined experimentally using a grid search that (1) the best window size is four; (2) the number of epochs used for training is 30; (3) the initial learning rate is

10^{- 2}

. Table 1 presents the final embedding sizes used for classification determined after evaluation.

4.3. Document Embeddings

Using the five WordEmbs, we construct a Doc2Vec for each review as an average of the WordEmbs for the terms in the document. The size of the Doc2Vec is equal to the size of the WordEmb used.

4.4. Topic Modeling

We identify 10 topics using the TFIDF document–term matrix as input together with the three Topic Modeling algorithms, i.e., LSA, NMF, and LSI. From each topic, the first 15 most relevant features are used in the algorithm for computing the topic embeddings. The number of documents where a topic is the most relevant is presented in Table 2. Table 3, Table 4 and Table 5 present the results for LDA, NMF, and LSI, respectively.

Analyzing the results of Table 3, we observe that LDA extracts diverse topics that can be interpreted using the keywords, e.g., Topic 0 is related to racing games, Topic 4 is related to sports games. Furthermore, LDA also manages to determine topics that find hidden latent semantic patterns that describe polarity, e.g., Topic 9 and Topic 7. We also note that LDA manages to detect and group together documents that have words in other languages than English, e.g., Topic 3, being the only algorithm among the three used in our analysis that picked up on the this negligible percent (0.69%) of comments.

As in the case of LDA, NMF (Table 4) manages to determine topics related to different games’ genres and polarity. Unlike LDA, NMF manages to discover topics that that group together both polarity and game type, e.g., Topic 0, Topic 1, Topic 2. Furthermore, NMF fails to discover the comments that use a different language to English.

Finally, LSI (Table 5) manages to determine topics related to the overall game play experience and users’ opinion towards this aspect, e.g., Topic 0, Topic 2, Topic 5, Topic 6. Thus, most of the topics detected by LSI contain similar terms, e.g., game, play, to underline some of the polarity, e.g., good, awesome, beautiful, bad, fun, terrible.

4.5. Topic to Vector

For each topic determined by an algorithm, we build a Topic2Vec as the weighted average of the WordEmb and the importance of each relevant word that describes the topic. Thus, the size of the Topic2Vec is same as the size of the used WordEmb.

4.6. Document-Topic to Vector

A DocTopic2Vec is created by concatenating the Doc2Vec with the Topic2Vec of the dominant topic for a document. The same WordEmb is used when constructing the Doc2Vec and Topic2Vec embeddings that are concatenated for building the DocTopic2Vec embedding. Thus, the size of the DocTopic2Vec is twice the size of the used WordEmb.

4.7. Classification Algorithms

The classification experiments with LogReg are computed using both Doc2Vec and DocTopic2Vec. For this model to achieve a stronger regularization, we set the inverse regularization parameter C to

10^{- 5}

.

Using the GRU units, we built multiple models:

(1): One with a single GRU Layer;
(2): One with three GRU layers (3GRU);
(3): One with a single BiGRU Layer, and
(4): One with three BiGRU layers (3BiGRU).

All these models have a final Dense Layer used for the final classification. Each GRU layer is initialized with 128 units and a dropout of 0.2. The activation for the update gate is the sigmoid function (Equation (13)) and for the reset gate the hyperbolic tangent function (Equation (14)). The sigmoid function is defined in

(0, 1)

and is used for models that utilize the probability of a variable. The hyperbolic tangent function is defined on the

[- 1, 1]

interval and it is mainly used to better differentiate between the strongly negative values and 0. The Dense output layer is initialized using the softmax activation function (Equation (15)) and with three as the dimension, corresponding with the number of possible values for the polarity. For multiclass classification, the softmax function is a generalized logistic activation function used to normalize the output of a network

x = (x_{1}, x_{2}, \dots, x_{K})

to a probability distribution over predicted output classes

i = \bar{1, K}

. In our case, we set

K = 3

, as we are predicting the positive, negative, or neutral polarity.

s i g m o i d (x) = \frac{1}{1 + e^{- x}}

(13)

t a n h (x) = \frac{e^{2 x} - 1}{e^{2 x} + 1}

(14)

s o f t m a x {(x)}_{i} = \frac{e^{x_{i}}}{\sum_{j = 1}^{K} e^{x_{j}}}

(15)

Using the LSTM units, we built multiple models to mirror the GRU architectures:

(1): One with a single LSTM Layer;
(2): One with three LSTM layers (3LSTM);
(3): One with a single BiLSTM Layer, and
(4): One with three BiLSTM layers (3BiLSTM).

We keep the same initialization parameters for the LSTM and Dense layer as for the GRU architectures. The activation for the input, output, and forget gates is the sigmoid function, while for the hidden state and the cell input activation vector is the hyperbolic tangent function. The LSTM models use the same loss function and optimizer parameter.

As CNN architectures proved to be an asset for text classification [10], we build a CNN Sentiment Analysis architecture with three layers: CNN, MaxPooling, and Dense. We initialize the filters to 64 and the kernel size to half the size of the input vector, i.e., Doc2Vec or DocTopic2Vec. We also add the CNN and MaxPooling layers on top of the four GRU and four LSTM architectures to determine if convolutions on top of recurrent layers improve the classification as in [9]. Moreover, we implement the Deep Learning Architecture presented in [3] using the same configuration. In the experiments, we name this architecture 4CNN-BiLSTM.

For all the Deep Neural Network Architectures, we utilize a batch size of 5000 to accurately estimate the gradient error in the detriment of the drawback known as slowing the convergence of the learning process. The loss is computed using categorical cross entropy and the applied optimizer is Adam. Each network is trained with a maximum of 200 epochs, using an automated stopping mechanism that stops the execution if the accuracy is not improved during 20 successive epochs.

4.8. Implementation

The entire pipeline is implemented in Python3.7. For named entity recognition and lemmatization we used the en_core_web_sm from the SpaCy [42] package. We use the gensim [43] and python-glove [44] packages for the WordEmbs and the scikit-learn [45] package for the TFIDF vectorization, LogReg classifier, and Topic Modeling algorithms. All the DNN Architectures are implemented in Keras [46] with TensorFlow [47] as the tensor backend engine. The experiments are run on an NVIDIA^® DGX Station^™. The code is freely available online on GitHub at https://github.com/cipriantruica/DocTopic2Vec.

4.9. Results

Table 6, Table 7, Table 8, Table 9 and Table 10 present the average accuracy obtained after 10 distinct training experiments. As a baseline for the embeddings, we use the Doc2Vec, while, for classification, we use LogReg. We utilize Stratified Cross-Validation for splitting the dataset into 80–20% training–testing sets with random seeding, i.e., 72,132 reviews for training and 18,033 for testing. Furthermore, we identified that the better the data are cleaned, i.e., as little as possible misspelled or foreign words left in the dataset, the better the accuracy of the classification tasks is, with an increase of even 10% in accuracy compared to other data normalization methods.

The proposed DocTopic2Vec improves significantly the detection of polarity at document-level, for both NMF and LSI, over the simple implementation of Doc2Vec with over 5%. In the case of the LDA, we observe a decrease in accuracy. When using the Word2Vec CBOW model to construct DocTopic2Vec (Table 6), we obtain the best results and the overall best accuracy (i.e., 0.7718) for all our experiments with the GRU architecture and LSI topic model. For the Word2Vec Skip-Gram (Table 7), the CNN-BiGRU architecture with the DocTopic2Vec for NMF obtain the best results. The CNN-BiGRU architecture also achieves the best results when building the DocTopic2Vec using FastText and LSI (Table 8 and Table 9). When using GloVe and the LSI topic model to construct the DocTopic2Vec (Table 10), the best results are obtained with the novel CNN-3BiGRU architecture.

The experimental results show that the polarity detection accuracy is improved if the Topic Modeling algorithms meet at least one of the following two conditions:

(1): The document to dominant topic distribution is balanced and manages to group context-related documents together;
(2): The importance of the terms that belong to the topic have a small value range in order to enhance the document vectorization with the context-dependent terms.

Thus, depending on the used Topic Modeling algorithm, the overall performance of the proposed model changes.

LSI manages to meet the first condition needed to improve the accuracy of the polarity detection task. The importance of the relevant keywords for a topic detected with LSI has values ≤1. These values influence the Topic2Vec values (Table 5). Thus, the final DocTopic2Vec’s values remain balanced for the entire encoding, and the context extracted thought Topic Modeling in conjunction with the distribution of document to dominant topic improves the classification task (Table 2).

NMF satisfies both conditions needed to improve the accuracy of the Sentiment Analysis task. For NMF, the importance of the relevant keywords is not normalized and has values in the range

[0, 6.64]

, but the majority of the values are still ≤1 (Table 4). When building the Topic2Vec for NMF, some dimensions are going to have higher values which add more importance to the context-related words. During the training of the model, the higher values introduce bias to these dimensions in the classification task and manage to influence the accuracy of detecting the document-level polarity by better grouping documents together. Moreover, the more balanced distribution of documents to the dominant topic obtained by the NMF (Table 2) also influences the context-based grouping of documents.

LSA does not meet any of the two conditions needed for an improved polarity detection model; thus, the accuracy decreases. When using LDA to build the DocTopic2Vec, LogReg and RNN results are influenced by the importance of a word to a topic and the distribution of the document to the dominant topic. The relevant words for some topics have high importance (Table 3). Thus, the Topic2Vec values are larger than the Doc2Vec values. Because the distributions of document to dominant topic is not balanced (Table 2), the Topic2Vec with the highest values is assigned to the majority of documents. When concatenating the DocTopic2Vec, the second half of the embedding and the imbalanced distribution of the document to dominant topic influences the classification task and the results are similar to flipping a coin.

For the CNN with bidirectional RNNs models, we obtain better results for DocTopic2Vec constructed with LDA, as the Deep Neural Network uses convolutions to select values. Therefore, the impact of the second half of the Topic2Vec values, as well as the imbalanced document grouping, are minimized, and for some tests, we obtain better results.

We observe that, on average, we obtain better results when using bidirectional models for both the fully connected (e.g., BiGRU, 3BiGRU, etc.) and convolutional architectures (e.g., CNN-BiGRU, CNN-3BiGRU, etc.). We note that, on average, the proposed new architectures perform better for this task. Furthermore, our models outperform the state of the art 4CNN-BiLSTM architecture. Stacking multiple layers of RNNs (e.g., 3GRU, 3BiGRU, 3GRU, etc.) with or without using a CNN brings very little improvement in accuracy over the architectures with a single layer. In case they are better, they only bring a ∼

1 %

improvement. The same observations can be deducted for the architectures that stacks multiple CNNs, i.e., 4CNN-BiLSTM.

As a final remark, we compare our results with the results obtained on the same dataset in [48] and in [40]. Our proposed Deep Neural Network architectures outperform with ∼

10 %

the Transformer-based models in [40] that obtained an accuracy of only 0.67.

5. Conclusions

In this paper, we propose DocTopic2Vec, a novel embedding that incorporates contextual cures through the use of Topic Modeling. We use a dataset with game reviews to learn different WordEmb models, i.e., Word2Vec, FastText, and GloVe. Applying the different WordEmb, we create Doc2Vecs for each review and Topic2Vecs for each topic extracted by LDA, NMF, and LSI. A DocTopic2Vec is constructed for each review as the concatenation of its Doc2Vec with the Topic2Vec for its dominant topic. Both Doc2Vec and Topic2Vec use the same WordEmb when are concatenated into the DocTopic2Vec. To prove the efficiency of the new proposed DocTopic2Vec in the task of Document-Level Sentiment Analysis, we implement different Deep Neural Network (DNN) Architectures using combinations of fully connected (i.e., GRU, LSTM, BiGRU, BiLSTM, Dense) and convolutional (CNN) layers. Furthermore, we propose six novel Convolutional-based Recurrent DNN Architectures that outperform the state of the art 4CNN-BiLSTM architecture [3].

The experimental results show an improvement in accuracy in determining the document-lever polarity of ∼

5 %

when employing the new proposed context-enhanced DocTopic2Vec for the NMF- and LSI-based topic embeddings over the baseline, i.e., Doc2Vec with LogReg. These embeddings manage to improve the classification by:

(1): Grouping context-related documents together through the document to dominant topic distribution;
(2): Enhancing the document vectorization with the importance of context-dependent terms that belong to the topic.

Furthermore, we observe that if the Topic Modeling algorithm does not meet these requirements, the polarity detection accuracy drops significantly, as in the case of LDA. Finally, we want to note that our proposed CNN-(Bi)RNN architectures outperform the best performing state-of-the-art model with ∼

10 %

applied on the same dataset in [40].

By combining Topic Modeling with the Sentiment Analysis task and by obtaining better results, we manage to answer

(Q_{1})

and to fulfill objective

(O_{1})

. We answer

(Q_{2})

by adding local and global context through the novel DocTopic2Vec embedding and improving the accuracy of detecting the polarity of textual data, thus achieving objective

(O_{2})

. By introducing novel CNN-(Bi)RNN Deep Learning Architectures that improve the accuracy of the Sentiment Analysis task, we answer our final research question

(Q_{3})

and complete objective

(O_{3})

.

As future work, we aim to test other embeddings, e.g., Mittens [49] which learns domain-specific representations, MOE [50] which manages word misspellings, BERT [22] which considers the word’s occurrence and position when computing its context. Furthermore, we plan to explore how the WordEmbs used in this paper could be used with other neural networks, such as Hierarchical Attention Networks or Deep Belief Networks.

Author Contributions

Conceptualization, C.-O.T., E.-S.A. and M.-L.Ș.; methodology, C.-O.T., E.-S.A. and M.-L.Ș.; software, C.-O.T., E.-S.A. and M.-L.Ș.; validation, C.-O.T., E.-S.A. and M.-L.Ș.; formal analysis, C.-O.T., E.-S.A., M.-L.Ș. and A.P.; investigation, C.-O.T., E.-S.A., M.-L.Ș. and A.P.; resources, C.-O.T. and E.-S.A.; data curation, E.-S.A. and M.-L.Ș.; writing—original draft preparation, C.-O.T., E.-S.A., M.-L.Ș. and A.P.; writing—review and editing, C.-O.T., E.-S.A. and A.P.; visualization, C.-O.T. and E.-S.A.; supervision, C.-O.T. and E.-S.A.; project administration, C.-O.T., E.-S.A. and A.P.; funding acquisition, A.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Acknowledgments

The research presented in this paper was supported in part by the German Federal Ministry of Education and Research (BMBF) through the project QURATOR (Grant No. 03WKDA1F) and PANQURA (Grant No. 03COV03F), and the German Academic Exchange Service (DAAD) through the projects “Deep-Learning Anomaly Detection for Human and Automated Users Behavior” (Grant No. 91809358) and “AWAKEN: content-Aware and netWork-Aware faKE News mitigation” (Grant No. 91809005).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

WordEmb	Word Embedding
Word2Vec	Word to vector
GloVe	Global Vectors
Doc2Vec	Document to Vector
CBOW	Continuous Bag-of-Words
Topic2Vec	Topic to Vector
DocTopic2Vec	Document-Topic to Vector
DNN	Deep Neural Networks
RNN	Recurrent Neural Networks
BiRNN	Bidirectional Recurrent Neural Networks
GRU	Gated Recurrent Unit
BiGRU	Bidirectional Gated Recurrent Unit
LSTM	Long Short-Term Memory
BiLSTM	Bidirectional Long Short-Term Memory
CNN	Convolutional Neural Networks
LDA	Latent Dirichlet Allocation
LSI	Latent Semantic Indexing
NMF	Non-Negative Matrix Factorization
LogReg	Logistic Regression
TF	Term Frequency
IDF	Inverse Document Frequency
TFIDF	Term Frequency–Inverse Document Frequency
ABCDM	Attention-based Bidirectional CNN-RNN Deep Model
TP	True Positive
FP	False Positive
TN	True Negative
FN	False Negative

References

Naseem, U.; Razzak, I.; Musial, K.; Imran, M. Transformer based Deep Intelligent Contextual Embedding for Twitter sentiment analysis. Future Gener. Comput. Syst. 2020, 113, 58–69. [Google Scholar] [CrossRef]
Rezaeinia, S.M.; Rahmani, R.; Ghodsi, A.; Veisi, H. Sentiment analysis based on improved pre-trained word embeddings. Expert Syst. Appl. 2019, 117, 139–147. [Google Scholar] [CrossRef]
Rhanoui, M.; Mikram, M.; Yousfi, S.; Barzali, S. A CNN-BiLSTM Model for Document-Level Sentiment Analysis. Mach. Learn. Knowl. Extr. 2019, 1, 832–847. [Google Scholar] [CrossRef] [Green Version]
Yusof, N.N.; Mohamed, A.; Abdul-Rahman, S. Context Enrichment Model Based Framework for Sentiment Analysis. In International Conference on Soft Computing in Data Science; Springer: Singapore, 2019; pp. 325–335. [Google Scholar] [CrossRef]
Yadav, A.; Vishwakarma, D.K. Sentiment analysis using deep learning architectures: A review. Artif. Intell. Rev. 2020, 53, 4335–4385. [Google Scholar] [CrossRef]
Vijayaragavan, P.; Ponnusamy, R.; Aramudhan, M. An optimal support vector machine based classification model for sentimental analysis of online product reviews. Future Gener. Comput. Syst. 2020, 111, 234–240. [Google Scholar] [CrossRef]
Nemes, L.; Kiss, A. Social media sentiment analysis based on COVID-19. J. Inf. Telecommun. 2020, 5, 1–15. [Google Scholar] [CrossRef]
Mikolov, T.; Karafiát, M.; Burget, L.; Černockỳ, J.; Khudanpur, S. Recurrent neural network based language model. In Proceedings of the Conference of the International Speech Communication Association, Chiba, Japan, 26–30 September 2010; pp. 1045–1048. [Google Scholar]
Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar] [CrossRef] [Green Version]
Wang, R.; Li, Z.; Cao, J.; Chen, T.; Wang, L. Convolutional Recurrent Neural Networks for Text Classification. In Proceedings of the International Joint Conference on Neural Networks, Budapest, Hungary, 14–19 July 2019; pp. 1–6. [Google Scholar] [CrossRef]
Le, Q.; Mikolov, T. Distributed Representations of Sentences and Documents. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1188–1196. [Google Scholar]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Arora, S.; Ge, R.; Moitra, A. Learning Topic Models—Going beyond SVD. In Proceedings of the Annual Symposium on Foundations of Computer Science, New Brunswick, NJ, USA, 20–23 October 2012; pp. 1–10. [Google Scholar] [CrossRef] [Green Version]
Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 1990, 41, 391–407. [Google Scholar] [CrossRef]
D’Andrea, A.; Ferri, F.; Grifoni, P.; Guzzo, T. Approaches, tools and applications for sentiment analysis implementation. Int. J. Comput. Appl. 2015, 125, 26–33. [Google Scholar] [CrossRef]
Aziz, M.N.; Firmanto, A.; Fajrin, A.M.; Ginardi, R.H. Sentiment Analysis and Topic Modelling for Identification of Government Service Satisfaction. In Proceedings of the International Conference on Information Technology, Computer, and Electrical Engineering, Semarang, Indonesia, 27–28 September 2018; pp. 125–130. [Google Scholar] [CrossRef]
Yoon, H.G.; Kim, H.; Kim, C.O.; Song, M. Opinion polarity detection in Twitter data combining shrinkage regression and topic modeling. J. Inf. 2016, 10, 634–644. [Google Scholar] [CrossRef]
Usama, M.; Ahmad, B.; Song, E.; Hossain, M.S.; Alrashoud, M.; Muhammad, G. Attention-based sentiment analysis using convolutional and recurrent neural network. Future Gener. Comput. Syst. 2020, 113, 571–578. [Google Scholar] [CrossRef]
Basiri, M.E.; Nemati, S.; Abdar, M.; Cambria, E.; Acharya, U.R. ABCDM: An Attention-based Bidirectional CNN-RNN Deep Model for sentiment analysis. Future Gener. Comput. Syst. 2021, 115, 279–294. [Google Scholar] [CrossRef]
García-Pablos, A.; Cuadros, M.; Rigau, G. W2VLDA: Almost unsupervised system for Aspect Based Sentiment Analysis. Expert Syst. Appl. 2018, 91, 127–137. [Google Scholar] [CrossRef] [Green Version]
Al-Janabi, O.M.; Malim, N.H.A.H.; Cheah, Y.N. Aspect Categorization Using Domain-Trained Word Embedding and Topic Modelling. In Advances in Electronics Engineering; Springer: Singapore, 2020; pp. 191–198. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:cs.CL/1907.11692. [Google Scholar]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In Proceedings of the International Conference on Learning Representations, Virtual Conference, 26 April–1 May 2020; pp. 1–17. [Google Scholar]
Biswas, E.; Karabulut, M.E.; Pollock, L.; Vijay-Shanker, K. Achieving Reliable Sentiment Analysis in the Software Engineering Domain using BERT. In Proceedings of the 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME), Adelaide, Australia, 28 September–2 October 2020; pp. 162–173. [Google Scholar] [CrossRef]
Zhao, L.; Li, L.; Zheng, X.; Zhang, J. A BERT based Sentiment Analysis and Key Entity Detection Approach for Online Financial Texts. In Proceedings of the 2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Dalian, China, 5–7 May 2021; pp. 1233–1238. [Google Scholar] [CrossRef]
Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; Volume 1, pp. 2227–2237. [Google Scholar] [CrossRef] [Green Version]
Alasadi, S.A.; Bhaya, W.S. Review of data preprocessing techniques in data mining. J. Eng. Appl. Sci. 2017, 12, 4102–4107. [Google Scholar] [CrossRef]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations, Scottsdale, AZ, USA, 2–4 May 2013; pp. 1–12. [Google Scholar]
Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef] [Green Version]
Mikolov, T.; Grave, E.; Bojanowski, P.; Puhrsch, C.; Joulin, A. Advances in Pre-Training Distributed Word Representations. In Proceedings of the International Conference on Language Resources and Evaluation, Miyazaki, Japan, 7–12 May 2018; pp. 52–55. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
Niu, L.; Dai, X.; Zhang, J.; Chen, J. Topic2Vec: Learning distributed representations of topics. In Proceedings of the International Conference on Asian Language Processing, Suzhou, China, 24–25 October 2015; pp. 193–196. [Google Scholar] [CrossRef]
Wang, Y.X.; Zhang, Y.J. Nonnegative matrix factorization: A comprehensive review. IEEE Trans. Knowl. Data Eng. 2012, 25, 1336–1353. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, H.; Liu, R.; Ye, Z.; Lin, J. Experimental explorations on short text topic mining between LDA and NMF based Schemes. Knowl.-Based Syst. 2019, 163, 1–13. [Google Scholar] [CrossRef]
Petrescu, A.; Truica, C.O.; Apostol, E.S. Sentiment Analysis of Events in Social Media. In Proceedings of the IEEE International Conference on Intelligent Computer Communication and Processing, Cluj-Napoca, Romania, 5–7 September 2019; pp. 143–149. [Google Scholar] [CrossRef]
Mitroi, M.; Truică, C.O.; Apostol, E.S.; Florea, A.M. Sentiment Analysis using Topic-Document Embeddings. In Proceedings of the 2020 IEEE 16th International Conference on Intelligent Computer Communication and Processing (ICCP), Cluj-Napoca, Romania, 3–5 September 2020; pp. 75–82. [Google Scholar] [CrossRef]
Yi, D.; Ji, S.; Bu, S. An Enhanced Optimization Scheme Based on Gradient Descent Methods for Machine Learning. Symmetry 2019, 11, 942. [Google Scholar] [CrossRef] [Green Version]
Secui, A.; Sirbu, M.D.; Dascalu, M.; Crossley, S.; Ruseti, S.; Trausan-Matu, S. Expressing Sentiments in Game Reviews. In Proceedings of the 17th International Conference on Artificial Intelligence: Methodology, Systems, and Applications (AIMSA 2016), Varna, Bulgaria, 7–10 September 2016; pp. 352–355. [Google Scholar] [CrossRef]
Ruseti, S.; Sirbu, M.D.; Calin, M.A.; Dascalu, M.; Trausan-Matu, S.; Militaru, G. Comprehensive Exploration of Game Reviews Extraction and Opinion Mining Using NLP Techniques. In Advances in Intelligent Systems and Computing; Springer: Singapore, 2019; pp. 323–331. [Google Scholar] [CrossRef]
Linzen, T. Issues in evaluating semantic spaces using word analogies. In Proceedings of the Workshop on Evaluating Vector-Space Representations for NLP, Berlin, Germany, 7–12 August 2016; pp. 13–18. [Google Scholar] [CrossRef]
Honnibal, M.; Montani, I. spaCy 3: Industrial-Strength Natural Language Processing. 2020. Available online: https://spacy.io/ (accessed on 13 October 2021).
Řehůřek, R.; Sojka, P. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, 22 May 2010; pp. 45–50. [Google Scholar]
Kula, M. glove-python. 2020. Available online: https://github.com/maciejkula/glove-python (accessed on 13 October 2021).
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Chollet, F. Keras. 2015. Available online: https://keras.io (accessed on 8 October 2021).
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: https://www.tensorflow.org/ (accessed on 13 October 2021).
Sirbu, D.; Secui, A.; Dascalu, M.; Crossley, S.A.; Ruseti, S.; Trausan-Matu, S. Extracting Gamers’ Opinions from Reviews. In Proceedings of the 2016 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), Timisoara, Romania, 24–27 September 2016. [Google Scholar] [CrossRef]
Dingwall, N.; Potts, C. Mittens: An Extension of GloVe for Learning Domain-Specialized Representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, New Orleans, LA, USA, 1–6 June 2018; pp. 212–217. [Google Scholar] [CrossRef]
Piktus, A.; Edizel, N.B.; Bojanowski, P.; Grave, E.; Ferreira, R.; Silvestri, F. Misspelling Oblivious Word Embeddings. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA, 2–7 June 2019; pp. 3226–3234. [Google Scholar] [CrossRef]

Figure 1. The proposed topic-based Sentiment Analysis using a contextual cues architecture.

Figure 2. DNN architecture.

Table 1. Embedding sizes.

Embedding	Model	Size
Embedding	Model	Embedding	Doc2Vec	DocTopic2Vec
Word2Vec	CBOW	256	256	512
Word2Vec	Skip-Gram	128	128	256
FastText	CBOW	256	256	512
FastText	Skip-Gram	128	128	256
GloVe	N/A	128	128	256

Table 2. Number of documents by dominant topic.

LDA		NMF		LSI
Topic ID	Documents	Topic ID	Documents	Topic ID	Documents
7	83,743	0	28,957	0	81,260
9	5297	3	14,490	2	2037
3	620	1	10,400	1	1664
4	135	2	7356	9	1660
5	108	5	7253	8	1200
0	73	4	6458	7	947
2	64	7	5864	4	788
8	53	8	3643	6	369
1	38	9	2890	5	193
6	34	6	2854	3	47

Table 3. The most relevant features for topics generated with LDA.

Keywords and Impact	Topic 0	Topic 1	Topic 2	Topic 3	Topic 4	Topic 5	Topic 6	Topic 7	Topic 8	Topic 9
0	racing	rule	dinosaur	para	rock	splinter	competent	game	sweet	game
0	68.54	27.97	41.72	102.85	92.42	44.41	25.88	8390.99	23.92	1112.34
1	rally	guitar	worm	con	football	cent	cough	not	piranha	good
1	60.52	25.26	36.27	82.42	86.12	42.98	24.28	5522.88	19.7	896.02
2	uplay	medal	twitch	mas	ace	cell	bout	play	harry	great
2	43.88	19.2	34.66	63.26	51.38	42.15	15.71	3433.44	19.4	560.82
3	dirt	ego	bastion	no	manager	strike	pet	good	nan	play
3	40.75	18.57	31.99	62.47	42.61	37.32	15.58	2937.24	18.17	439.94
4	coaster	fez	myst	las	soccer	counter	nonstop	great	breach	awesome
4	34.61	16.11	30.74	55.34	40.53	36.7	14.02	2137.95	16.36	342.82
5	blah	honor	noir	son	innovate	addict	demigod	time	potter	love
5	33.72	16.07	27.25	40.9	23.84	31.01	13.24	2065.73	16.07	300.87
6	processor	legal	preview	ser	mesa	spore	splendid	fun	napoleon	amazing
6	27.44	15.57	24.86	39.13	18.49	27.85	13.12	2010.35	11.27	273.44
7	terror	outstanding	jest	bien	sherlock	annihilation	remote	no	ruler	graphic
7	26.92	15.36	23.08	34.18	18.1	26.24	11.94	1876.78	10.11	261.46
8	golf	wizardry	van	bom	bastard	outlast	fluidity	story	refreshingly	fun
8	26.56	13.61	22.56	24.11	17.46	26.19	9.61	1857.25	9.72	234.68
9	roller	article	enthral	sin	submit	halo	closet	graphic	orientate	not
9	24.31	12.29	19.52	22.63	16.62	24.45	9.57	1788	9.71	226.23
10	fallen	wine	hall	dos	doctor	dogs	dying	bad	hoi	cool
10	23.9	11.57	19.46	22.61	15.63	23.59	9.37	1623.48	7.94	149.77
11	theft	shameless	working	hay	cos	ing	nook	feel	sensational	excellent
11	22.72	10.94	19.26	21.38	14.38	21.2	9.09	1470.19	7.24	144.28
12	bye	apologize	tycoon	mal	overture	suck	bomb	thing	bean	buy
12	20.01	9.060	18.17	21.33	13.55	21.09	8.8	1359.17	6.32	138.63
13	grand	sailing	pour	excelente	thumb	gra	transformer	player	crapy	year
13	18.81	8.43	17.89	20.64	13.27	20.33	8.61	1346.69	5.26	127.48
14	car	tribunal	braid	vale	cheat	sports	man	buy	fore	perfect
14	18.47	8.20	17.5	19.7	11.4	20.13	8.57	1328.62	5.08	123.02

Table 4. The most relevant features for topics generated with NMF.

Keywords and Impact	Topic 0	Topic 1	Topic 2	Topic 3	Topic 4	Topic 5	Topic 6	Topic 7	Topic 8	Topic 9
0	game	good	great	story	play	not	amazing	fun	love	awesome
0	6.64	4.91	4.94	2.05	5.24	5.98	4.86	4.33	4.46	4.15
1	time	game	game	character	hour	bad	story	lot	game	graphic
1	0.57	0.79	0.75	1.63	0.60	2.01	0.53	0.74	0.29	1.10
2	no	graphic	graphic	feel	time	buy	graphic	friend	series	nice
2	0.54	0.39	0.38	1.25	0.57	1.37	0.52	0.33	0.14	0.24
3	player	pretty	story	level	year	money	game	hour	hate	cool
3	0.39	0.13	0.15	1.23	0.52	0.83	0.52	0.26	0.12	0.20
4	release	story	music	time	friend	people	simply	pretty	fan	game
4	0.38	0.12	0.14	1.07	0.42	0.67	0.31	0.26	0.12	0.19
5	people	series	atmosphere	combat	never	graphic	absolutely	worth	original	sound
5	0.38	0.09	0.13	1.06	0.28	0.56	0.25	0.24	0.11	0.18
6	strategy	nice	sound	no	player	review	episode	mode	absolutely	music
6	0.38	0.09	0.11	0.96	0.27	0.50	0.23	0.22	0.10	0.15
7	bug	shooter	job	enemy	free	worth	perfect	price	buy	car
7	0.35	0.09	0.09	0.91	0.26	0.49	0.21	0.19	0.09	0.13
8	year	year	fantastic	not	day	bug	music	bit	favorite	perfect
8	0.33	0.08	0.09	0.84	0.25	0.47	0.20	0.18	0.07	0.11
9	developer	sound	excellent	system	stop	pay	beautiful	nice	expansion	realistic
9	0.31	0.08	0.08	0.83	0.24	0.45	0.19	0.17	0.07	0.09
10	review	racing	recommend	puzzle	enjoy	thing	buy	recommend	hope	excellent
10	0.3	0.07	0.07	0.80	0.21	0.44	0.16	0.17	0.07	0.09
11	fan	perfect	action	interesting	long	no	fantastic	campaign	cool	story
11	0.29	0.06	0.07	0.78	0.19	0.44	0.15	0.16	0.07	0.08
12	never	music	expansion	world	game	waste	incredible	level	story	totally
12	0.28	0.06	0.06	0.76	0.17	0.44	0.14	0.16	0.07	0.07
13	work	world	lot	thing	start	work	sound	simple	perfect	racing
13	0.27	0.06	0.06	0.73	0.16	0.38	0.13	0.15	0.07	0.06
14	enjoy	atmosphere	puzzle	bit	single	release	atmosphere	challenge	fall	effect
14	0.26	0.05	0.06	0.73	0.16	0.38	0.13	0.15	0.06	0.06

Table 5. The most relevant features for topics generated with LSI.

Keywords and Impact	Topic 0	Topic 1	Topic 2	Topic 3	Topic 4	Topic 5	Topic 6	Topic 7	Topic 8	Topic 9
0	game	good	great	game	play	play	love	fun	love	awesome
0	0.58	0.77	0.79	0.53	0.57	0.39	0.47	0.74	0.72	0.64
1	not	game	game	play	fun	not	not	not	awesome	graphic
1	0.34	0.23	0.19	0.34	0.41	0.36	0.35	0.24	0.19	0.44
2	good	great	love	love	time	great	play	graphic	player	bad
2	0.24	0.15	0.11	0.11	0.12	0.33	0.3	0.12	0.15	0.27
3	play	graphic	fun	buy	hour	buy	story	awesome	good	amazing
3	0.24	0.05	0.09	0.08	0.12	0.16	0.27	0.11	0.1	0.19
4	great	awesome	awesome	year	level	bad	amazing	lot	map	no
4	0.16	0.05	0.08	0.06	0.11	0.13	0.17	0.09	0.1	0.15
5	time	play	amazing	friend	lot	good	awesome	nice	no	nice
5	0.13	0.05	0.07	0.03	0.1	0.12	0.16	0.08	0.09	0.1
6	fun	amazing	graphic	amazing	player	fun	graphic	bad	buy	sound
6	0.12	0.04	0.03	0.03	0.09	0.1	0.13	0.08	0.09	0.09
7	graphic	love	fantastic	release	love	money	episode	game	expansion	car
7	0.11	0.03	0.03	0.03	0.09	0.09	0.1	0.07	0.08	0.07
8	story	year	music	steam	friend	pay	bad	puzzle	single	control
8	0.11	0.01	0.03	0.03	0.08	0.06	0.09	0.07	0.07	0.07
9	no	racing	puzzle	money	story	people	character	worth	fun	year
9	0.1	0.01	0.03	0.03	0.08	0.06	0.07	0.06	0.07	0.05
10	bad	excellent	story	fun	character	player	buy	pretty	unit	bug
10	0.1	0.01	0.03	0.03	0.07	0.05	0.06	0.06	0.07	0.05
11	feel	perfect	recommend	never	awesome	free	puzzle	buy	bug	port
11	0.08	0.01	0.02	0.03	0.07	0.05	0.05	0.05	0.07	0.05
12	buy	strategy	excellent	awesome	amazing	bug	adventure	car	release	version
12	0.08	0.01	0.02	0.03	0.06	0.05	0.04	0.04	0.07	0.05
13	thing	music	atmosphere	pay	mode	friend	music	cool	campaign	terrible
13	0.08	0.01	0.02	0.03	0.05	0.05	0.04	0.04	0.07	0.05
14	love	adventure	beautiful	free	enjoy	year	atmosphere	price	lot	play
14	0.08	0.01	0.02	0.03	0.05	0.05	0.03	0.04	0.06	0.05

Table 6. Experiments using the Word2Vec CBOW embeddings of size 256. (Note: Bold marks the best accuracy obtained for the combination of Algorithm and Doc2Vec/DocTopic2Vec).

Algorithm	Doc2Vec	DocTopic2Vec
Algorithm	Doc2Vec	LDA	NMF	LSI
LogReg	0.7259	0.6827	0.7614	0.7610
GRU	0.7429	0.5770	0.7663	0.7718
BiGRU	0.7440	0.5768	0.7681	0.7713
3GRU	0.7464	0.5769	0.7708	0.7713
3BiGRU	0.7469	0.5769	0.7700	0.7668
LSTM	0.7407	0.5771	0.7686	0.7715
BiLSTM	0.7457	0.5769	0.7678	0.7686
3LSTM	0.7459	0.5770	0.7713	0.7689
3BiLSTM	0.7491	0.5768	0.7685	0.7674
CNN	0.7230	0.3111	0.7515	0.7583
CNN-GRU	0.7400	0.5768	0.7613	0.7589
CNN-BiGRU	0.7455	0.7636	0.7643	0.7708
CNN-3GRU	0.7415	0.5768	0.7603	0.7631
CNN-3BiGRU	0.7463	0.5768	0.7596	0.7615
CNN-LSTM	0.7423	0.5769	0.7635	0.7679
CNN-BiLSTM	0.7449	0.7647	0.7653	0.7662
CNN-3LST	0.7401	0.5768	0.7616	0.7633
CNN-3BiLSTM	0.7411	0.7608	0.7603	0.7618
4CNN-BiLSTM [3]	0.7244	0.7463	0.7425	0.7534

Table 7. Experiments using the Word2Vec Skip-Gram embeddings of size 128. (Note: Bold marks the best accuracy obtained for the combination of Algorithm and Doc2Vec/DocTopic2Vec).

Algorithm	Doc2Vec	DocTopic2Vec
Algorithm	Doc2Vec	LDA	NMF	LSI
LogReg	0.7215	0.5772	0.7533	0.7542
GRU	0.7334	0.5768	0.7536	0.7532
BiGRU	0.7359	0.5767	0.7553	0.7563
3GRU	0.7378	0.5768	0.7563	0.7579
3BiGRU	0.7390	0.5767	0.7557	0.7608
LSTM	0.7322	0.5768	0.7530	0.7531
BiLSTM	0.7375	0.5767	0.7557	0.7559
3LSTM	0.7355	0.5768	0.7540	0.7563
3BiLSTM	0.7401	0.5768	0.7584	0.7593
CNN	0.7230	0.5768	0.7507	0.7560
CNN-GRU	0.7387	0.5768	0.7606	0.7543
CNN-BiGRU	0.7445	0.7574	0.7655	0.7628
CNN-3GRU	0.7392	0.5768	0.7569	0.7589
CNN-3BiGRU	0.7456	0.7573	0.7648	0.7639
CNN-LSTM	0.7405	0.5768	0.7599	0.7599
CNN-BiLSTM	0.7430	0.7586	0.7617	0.7594
CNN-3LSTM	0.7306	0.5768	0.7622	0.7586
CNN-3BiLSTM	0.7384	0.7578	0.7630	0.7640
4CNN-BiLSTM [3]	0.7332	0.7592	0.7559	0.7573

Table 8. Experiments using the FastText CBOW embeddings of size 256. (Note: Bold marks the best accuracy obtained for the combination of Algorithm and Doc2Vec/DocTopic2Vec).

Algorithm	Doc2Vec	DocTopic2Vec
Algorithm	Doc2Vec	LDA	NMF	LSI
LogReg	0.7201	0.6699	0.7518	0.7462
GRU	0.7412	0.5772	0.7547	0.7553
BiGRU	0.7399	0.5772	0.7586	0.7573
3GRU	0.7434	0.5771	0.7591	0.7558
3BiGRU	0.7450	0.5772	0.7575	0.7521
LSTM	0.7399	0.5772	0.7550	0.7565
BiLSTM	0.7405	0.5772	0.7599	0.7553
3LSTM	0.7416	0.5768	0.7584	0.7562
3BiLSTM	0.7441	0.5772	0.7591	0.7568
CNN	0.7206	0.2279	0.7496	0.7479
CNN-GRU	0.7343	0.5768	0.7555	0.7584
CNN-BiGRU	0.7440	0.7539	0.7547	0.7620
CNN-3GRU	0.7347	0.5768	0.7523	0.7578
CNN-3BiGRU	0.7409	0.7528	0.7556	0.7571
CNN-LSTM	0.7376	0.5768	0.7521	0.7522
CNN-BiLSTM	0.7421	0.7504	0.7554	0.7528
CNN-3LSTM	0.7376	0.5768	0.7536	0.7511
CNN-3BiLSTM	0.7366	0.7536	0.7566	0.7582
4CNN-BiLSTM [3]	0.7206	0.7430	0.7431	0.7394

Table 9. Experiments using the FastText Skip-Gram embeddings of size 128. (Note: Bold marks the best accuracy obtained for the combination of Algorithm and Doc2Vec/DocTopic2Vec).

Algorithm	Doc2Vec	DocTopic2Vec
Algorithm	Doc2Vec	LDA	NMF	LSI
LogReg	0.7227	0.5771	0.7514	0.7522
GRU	0.7329	0.5770	0.7583	0.7535
BiGRU	0.7359	0.5769	0.7625	0.7579
3GRU	0.7377	0.5768	0.7613	0.7561
3BiGRU	0.7389	0.5768	0.7639	0.7608
LSTM	0.7331	0.5768	0.7616	0.7515
BiLSTM	0.7349	0.5769	0.7613	0.7566
3LSTM	0.7355	0.5771	0.7609	0.7567
3BiLSTM	0.7400	0.5768	0.7637	0.7599
CNN	0.7240	0.2274	0.7551	0.7563
CNN-GRU	0.7441	0.5768	0.7612	0.7631
CNN-BiGRU	0.7467	0.7612	0.7653	0.7662
CNN-3GRU	0.7392	0.5768	0.7612	0.7650
CNN-3BiGRU	0.7452	0.7620	0.7644	0.7634
CNN-LSTM	0.7418	0.5768	0.7603	0.7612
CNN-BiLSTM	0.7426	0.7610	0.7617	0.7636
CNN-3LSTM	0.7295	0.5768	0.7620	0.7624
CNN-3BiLSTM	0.7428	0.7625	0.7632	0.7608
4CNN-BiLSTM [3]	0.7395	0.7597	0.7581	0.7546

Table 10. Experiments using the GloVe embeddings of size 128. (Note: Bold marks the best accuracy obtained for the combination of Algorithm and Doc2Vec/DocTopic2Vec).

Algorithm	Doc2Vec	DocTopic2Vec
Algorithm	Doc2Vec	LDA	NMF	LSI
LogReg	0.7089	0.6587	0.7408	0.7432
GRU	0.7097	0.5769	0.7440	0.7461
BiGRU	0.7174	0.5768	0.7457	0.7457
3GRU	0.7255	0.5768	0.7450	0.7480
3BiGRU	0.7301	0.5768	0.7475	0.7525
LSTM	0.7156	0.5768	0.7415	0.7472
BiLSTM	0.7199	0.5768	0.7454	0.7501
3LSTM	0.7264	0.5767	0.7458	0.7481
3BiLSTM	0.7307	0.5767	0.7479	0.7532
CNN	0.7032	0.5745	0.7447	0.7426
CNN-GRU	0.7280	0.5768	0.7504	0.7424
CNN-BiGRU	0.7330	0.7420	0.7552	0.7541
CNN-3GRU	0.7248	0.5768	0.7450	0.7447
CNN-3BiGRU	0.7299	0.5768	0.7548	0.7560
CNN-LSTM	0.7242	0.5768	0.7496	0.7506
CNN-BiLSTM	0.7293	0.7428	0.7543	0.7520
CNN-3LSTM	0.7256	0.5768	0.7489	0.7424
CNN-3BiLSTM	0.7238	0.5768	0.7517	0.7474
4CNN-BiLSTM [3]	0.7190	0.7423	0.7509	0.7453

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Truică, C.-O.; Apostol, E.-S.; Șerban, M.-L.; Paschke, A. Topic-Based Document-Level Sentiment Analysis Using Contextual Cues. Mathematics 2021, 9, 2722. https://doi.org/10.3390/math9212722

AMA Style

Truică C-O, Apostol E-S, Șerban M-L, Paschke A. Topic-Based Document-Level Sentiment Analysis Using Contextual Cues. Mathematics. 2021; 9(21):2722. https://doi.org/10.3390/math9212722

Chicago/Turabian Style

Truică, Ciprian-Octavian, Elena-Simona Apostol, Maria-Luiza Șerban, and Adrian Paschke. 2021. "Topic-Based Document-Level Sentiment Analysis Using Contextual Cues" Mathematics 9, no. 21: 2722. https://doi.org/10.3390/math9212722

APA Style

Truică, C.-O., Apostol, E.-S., Șerban, M.-L., & Paschke, A. (2021). Topic-Based Document-Level Sentiment Analysis Using Contextual Cues. Mathematics, 9(21), 2722. https://doi.org/10.3390/math9212722

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Topic-Based Document-Level Sentiment Analysis Using Contextual Cues

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Data Preprocessing Module

3.2. Word Embedding Module

3.2.1. Word2Vec

3.2.2. FastText

3.2.3. GloVe

3.3. Document Embedding Module

3.4. TFIDF Vectorization

3.5. Topic Modeling Module

3.5.1. Latent Dirichlet Allocation

3.5.2. Non-Negative Matrix Factorization

3.5.3. Latent Semantic Indexing

3.6. Topic Embedding Module

3.7. Document-Topic Embedding Module

3.8. Classification Module

3.8.1. Logistic Regression

3.8.2. Deep Neural Network

3.9. Evaluation Module

4. Experimental Results

4.1. Dataset

4.2. Word Embedding

4.3. Document Embeddings

4.4. Topic Modeling

4.5. Topic to Vector

4.6. Document-Topic to Vector

4.7. Classification Algorithms

4.8. Implementation

4.9. Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI