1. Introduction
With the rapid advent of the internet social media era, aspect-level sentiment analysis has received a lot of attention from both industry and academic communities. No matter the e-commerce market, such as eBay or social forums such as Twitter and YouTube, to improve users’ experiences and ensure its market competitiveness, it has placed increasing attentions toward its users’ expressed sentiment polarity identification. Jalil et al. [
1] gathered a large quantity of heterogeneous “COVID-Senti” data from the social networking media platform, Twitter. After the sophisticated data pre-processing procedure, they analyzed the COVID-19 influence and the vaccinations’ effectiveness by focusing on these Twitter user sentiments classification. To better embrace the industrial internet of things (IIoT) or Industry 4.0 era, Khan et al. [
2] focused on generalized aspect-based category detection (ACD) and proposed a novel convolutional attention-based bidirectional LSTM for the detection of customer opinion and emotions. In this way, IIoT can provide better services by analyzing users’ feedback.
Fine-grained sentiment polarity analysis from unstructured text is an essential task in automatic public opinions detection and consumer reviews attitude recognition. In general, aspect-based sentiment analysis (ABSA) has been a focus of research in recent years. Aspect-based sentiment analysis [
3], which identifies people’s opinions on specific topics, is an extraction and classification fused problem in natural language processing (NLP). Various ABSA works focus on the sentiment polarity (positive, neutral, and negative) of a target word in given comments or reviews [
4,
5]. For example, in the restaurant service scenario, “
Niceambience, but the highly overratedfast-food restaurant.”, the consumer mentions two opinion targets, namely “
ambience” and “
fast-food restaurant”, and expresses a positive attitude toward the first one, and negative sentiment toward the second one.
As the dominant line of research in fine-grained opinion mining, aspect-based sentiment analysis (ABSA) aims to identify sentiment of target entities and their aspects. Specifically, given a target entity of interest, ABSA methods can extract its properties and identify the sentiment expressed about those properties [
6]. From a technological point of view, the methodologies can be divided into two sub-tasks, namely opinion target extraction and target sentiment identification [
4,
5], which corresponds to the above-mentioned interested target entity properties extraction and expresses sentiment identification.
Despite the previous success made in the sentiment analysis research area, most existing methods ignore the interactive relations among these sub-tasks. As a result, existing methods model these sub-tasks independently of each other, which hinders their practical use performance, i.e., the goal of some methods [
7,
8,
9,
10] is only to detect the opinion target mentioned in the text. In addition, other methods [
11,
12,
13,
14,
15] identifying target sentiment polarity assume that the target mention is given. To perform the ABSA task in more practical settings, i.e., extracting the targets and the corresponding sentiments simultaneously, one typical way is to pipeline these two sub-tasks end-to-end. Essentially, these existing pipeline methods [
7,
16,
17,
18,
19] still treat these sub-tasks as separate two steps and are not sufficient to yield satisfactory results for the complete ABSA task. These pipeline methods complete the ABSA task through target boundary tags (e.g., B, I, E, S, and O) and sentiment tags (e.g., POS, NEG, and NEU) prediction. In recent years, some ideal joint approaches [
17,
19,
20,
21,
22,
23,
24] for ABSA have been proposed, which regard it as complex integrated boundaries and types tagging the prediction task, and make the two sub-tasks jointly trained. These joint methods differ from the above pipelines, and they utilize a set of specially designed tags integrated from discrete target boundary and sentiment tagging tasks, namely B-{POS, NEG, NEU}, I-{POS, NEG, NEU}, E-{POS, NEG, NEU}, S-{POS, NEG, NEU}, O-{POS, NEG, NEU}, denoting the beginning of, inside of, end of, and single-word opinion target with the positive, negative and neutral sentiment, and O denoting the positions of none of the sentiment words, respectively. An example of introducing the differences between these tagging schemes is shown in
Table 1, and we can intuitively discover that the “Integrated” tagging solution, which is presented in Row 2, is relatively more complex than the “Discrete” tagging scheme. The “Integrated” scheme greatly expands the corresponding tagging species and huge search space, which is prone to increasing the complexity of tagging predictions and decrease the performance of the overall ABSA tagger.
Through the above comprehensive and detailed analysis of those popular pipeline and joint methods, we judiciously deem that it is crucial to investigate the interactive relationships between these two sub-tasks for determining the target-oriented sentiment polarity more accurately. In the relational triple extraction research area, the work
CasREL [
25] and
tagging decomposition strategy [
26] firstly decompose the joint relational triples extraction into head-entity (
HE) and tail-entity and relation (
TER) extraction sub-tasks, and formulate the relations mapped from the subjects detection to objects detection, which finally efficiently solves the triple-overlapping problem. Inspired by these works, we investigate and exploit the high coupled inter-dependency of these sub-tasks and propose a novel cascade social comment sentiment analysis model, namely
CasNSA, to guide the identification of sentiment polarity with the auxiliary task of the target boundary prediction.
Specifically, our framework CasNSA contains three principal components: the contextual semantic representation (
CSR) module, target boundary recognizer (
TBR) and sentiment polarity identifier (
SPI). Firstly, the CSR component encodes the social comment sentences and then provides embedded vector sequences. To further investigate the representation power of transformers [
27], we simultaneously adopt a transformer-based pre-trained language encoder (e.g.,
BERT [
20,
28]) and pre-trained word embedding (e.g.,
Word2Vec [
29], and
GloVe [
30]) in this part. The CSR component finally provides a hidden state, which ranges from the start to end token throughout the inputted sentence. In the following, The TBR module utilizes the hidden states generated by the CSR component as the inputted sentence’s semantics. At the sentiment inference time, the SPI module utilizes the hidden state on the position “[CLS]” of the inputted token sequence and takes the advantages of the target boundary information generated from the TBR module. In this manner, the novel hierarchical tagging method constrains the sentiment analysis within the specific opinion-target context and thus achieves better overall performance for the ASBA task.
Generally speaking, compared with other mainstream approaches for the ABSA task, our method is a simple yet insightful neural network architecture. Our CasNSA regards the opinion-target extraction as the sequence labeling problem and regards the target-oriented sentiment analysis as the multi-label identification problem, respectively. At the same time, we also conduct a series of contrastive ablation experiments by designing different constructions of the CSR component. Our experimental results demonstrate the BERT’s absolute modeling advantage over traditional RNNs based on pre-trained fixed embeddings (e.g., Word2Vec, and GloVe). As a result, this proves BERT’s powerful semantic comprehension capabilities, one of the most popular transformer-based pre-trained language models.
Additionally, experiments prove that the proposed approach provides better F1-scores of 68.13%, 62.34%, 56.40% and 50.05% for SemEval 2014 [
31], 2015 [
32], 2016 [
33] and the Twitter dataset [
34], respectively. To some extent, these experiments prove that our proposed model outperforms on real-life datasets compared to many previously widely used methods [
14,
17,
20].
The principal contributions in this work can be summarized as follows:
We go deep into the complete aspect-based sentiment analysis task, and formulate it as the sequence tagging problem of the opinion-target extraction sub-task (OTE) and the multi-label identification problem for the target-oriented sentiment analysis sub-task (TSA). To be specific, we introduce a novel ABSA approach named CasNSA, which is composed of three main sub-modules: contextual semantic representation module, target boundary recognizer and sentiment polarity identifier.
While many methods attempt to model sub-tasks’ correlations, including machine learning methods and deep learning methods, our method attempts to address the ABSA task by the neural network construction. To the best of our knowledge, the unique modeling mechanism is proposed to handle the ABSA task for the first time. Based on our formulated deduction for the complete ABSA task, we employ the specific opinion-target context representation provided by the OTE into the TSA procedure, which can further constrain and guide the sentiment polarity analysis, thus achieving better performance for the complete end-to-end ABSA task.
The further empirical comparison result confirms the effectiveness and rationality of our model CasNSA. We first consider the interactive relations modeling between opinion target determining and target-specific sentiment identification. Furthermore, the ablation study also proves that the transformer-based BERT is more efficient in contrast with traditional pre-trained embedding methods (e.g., Word2Vec and GloVe).
This paper is structured as follows:
Section 1 introduces the
aspect-based social comment sentiment analysis research background, and
Section 2 describes the most relevant related works.
Section 3 is the core of this paper since it contains the preliminaries of the
E2E-ABSA task; the interactive relations formulation among the
OTE sub-task and the
TSA sub-task; and the architecture details of our proposed
CasNSA.
Section 4 discusses all validation experiments, including a comparison and ablation study on several widely used datasets. Finally,
Section 5 highlights the major conclusions of this work and the potential future works.
3. The CasNSA Framework: Architecture Details
On the whole, our CasNSA framework can be subdivided into three principal sub-modules: contextual semantic representation block (
CSR), target boundary recognizer (
TBR) and sentiment polarity identifier (
SPI). For the CSR module, we try several specific model structure implementations which derive from combinations of two variants of CSR and three variants of
SPI to pick the optimal model settings. The overall architecture of our
CasNSA within alternative sub-modules is shown in
Figure 1, and we describe its details below.
3.1. Task Preliminaries
In this work, given a user comment sentence , where L means the length of this sentence, end-to-end fine-grained aspect-based sentiment analysis (ABSA) is formulated as sequence tagging problems and multi-label classification problems. Specifically, it handles the sub-tasks: opinion-target extraction (OTE) and target-oriented sentiment analysis (TSA).
Definition 1. Opinion-target extraction (OTE) aims to predict a sequence of target-oriented position tags , in which denoting the beginning of, inside of, single token of, and outside of an opinion target. In particular, this sequence tagging task can be instantiated in many ways, including conditional random fields (CRFs) tagger, binary tagger, etc. In this paper, we employ the simple binary tagger, which adopts two identical binary classifiers to detect the start and end positions of each opinion target, and we describe its concrete implementations in Section 3.4. Definition 2. Target-oriented sentiment analysis (TSA) aims to conduct the label classification for every specific opinion target. TSA detects its corresponding sentiment polarity for each possible target from three sentiment type, containing positive, negative and neutral. Specifically, given a sentence and all opinion target , the TSA-relevant component predicts the possible tag where for each , denoting the positive, negative and neutral sentiments, respectively. For this sub-task, we employ three different feature recognition procedures to determine the optimal model setting, and the concrete details are shown in Section 3.5. 3.2. Explicit Task Modelling
Here, let us deeply explore the interactive relational mappings between the sub-tasks of OTE and TSA. The goal of E2E-ABSA is to identify a set including all possible opinion targets
and its corresponding sentiment polarities
. We directly design and formulate the E2E-ABSA overall training objective toward this goal. In contrast with previous approaches, such as that of Li et al., 2019 [
20], where the optimization objective is defined right at the integrated tagging scheme, including ‘B-POS’, ‘I-NEG’ and ‘E-NEU’, our optimization function considers the interrelation hidden in the integrated tags. Then, we can model the E2E-ABSA as the sequence tagging and multi-label classification separately.
Formally, given a user review
from the training dataset
D and a set of potentially opinion-oriented targets and corresponding sentiments
in
, we aim to maximize the whole optimized target likelihood of the training dataset
D.
where Equation (1) denotes the overall optimizing goal,
denotes all existing targets and their sentiments of the
i-th sentence
, respectively. This formula means the probabilistic optimization for predicting sentimental target entities in all corpora and the corresponding sentiment categories. Equation (2) denotes the inter-scheme tag dependencies modeling between target detection and sentiment polarity identification. We decompose the last joint probabilistic optimization into two step-by-step cascade probabilistic optimizations. Firstly, we extract all sentiment target entities. Secondly, we perform the corresponding sentiment discrimination based on the context semantics of the extracted entities. Equation (3) further decomposes the target optimizing function of the complete E2E-ABSA task. More specifically, we transfer the opinion-target entity extraction task to the binary label classification and transfer the target-oriented sentiment analysis task to the standard multi-label classification task. We utilize the standard cross-entropy loss function [
48] to optimize it.
This formulation provides several benefits. At first, in Equation (2), we take the mutual dependencies between the two sub-tasks into consideration, which almost all related researchers often neglect. The sub-tasks would be mutually influenced such that errors in each component can be constrained by the other, and thus it can help model the fine-grained aspect-based sentiment analysis better. Then, the E2E-ABSA task decomposition revealed in Equation (3) decreases the tagging complexity because the substitution of most unified approaches adopts the integrated tagging schemes, such as ‘B-POS’ and ‘I-NEG’. Finally, this sophisticated formulation represents the deep hierarchical neural network and thus can be instantiated in many implementations.
3.3. Contextual Semantic Representation Block
In the beginning, given a user comment
for sentiment analysis, we need to interpret the natural language sequence
S into a sequence of semantic feature vectors
. As is shown in the bottom right of
Figure 1, following most traditional methods [
17,
19,
27,
49], we employ the pre-trained word embeddings (
e,g. GloVe, and Word2Vec) to transfer the sentence tokens into a sequence of the vector. To investigate the language modeling power armed with Transformer, as is shown in the bottom left of
Figure 1, besides using pre-trained
GloVe, we simultaneously introduce an alternative scheme which utilizes the multi-layer bidirectional Transformer-based language model BERT [
28] as the sentence encoder.
Scheme 1: As is illustrated in the bottom right of
Figure 1, we first use the 300 dimension GloVe [
30] to initialize word embeddings. In the model training and reference time, we add an extra marker ’[START]’ before the start index of the inputs to generate the sentence-level feature vector. Then the embedding operation queries each word’s corresponding embedding and conducts the transfer process
. To prevent the vanishing-gradient problem [
50] existing in RNNs, we choose the two-layer Bi-LSTM as the basic encoder in which the Bi-LSTM hidden size is set to 200. Existing works [
12] have demonstrated a better learning capability than the original LSTM. Compared with vanilla recurrent neural network LSTM, bidirectional-LSTM is the same as LSTM in the mechanical aspect, but Bi-LSTM allows the reversed information flow in which the inputs can be fed from the end index to the beginning index. Finally, the encoder layer provides a forward hidden state
and a backward hidden state
. We list the LSTM feature propagation’s relevant formulations as follows:
The variables and in the above equations are the input, forget and output gate’s activation vectors, respectively. The three gated states and are calculated through a series of complex operations. The updated new memory of LSTM corresponds to the matrix multiplication of the input token feature and the updating matrix and . The remained old memory of LSTM corresponds to the matrix multiplication of the last hidden state and the forgotten matrix and . Finally, LSTM converts the logical value into a prob-value between 0 and 1 through an activation function . Furthermore, ∘ is the cell state vector, and and are the sigmoid and hyperbolic tangent functions.
After the information flows through LSTM, we concatenate the forward and backward and obtain the combined features where the j-th hidden state , then the obtained hidden state sequence H is used by the other two downstream tasks, OTE and TBR.
Scheme 2: For the sake of the disadvantage that the traditional fixed embedding layer (e.g., Word2Vec, and GloVe) only provides a single context-independent representation, as is illustrated in the bottom left of
Figure 1, our CSR module further adopts pre-train Transformer BERT [
28] during our experiments. Here, we briefly introduce BERT. Bidirectional encoder representations from Transformers, or BERT, is a revolutionary self-supervised pre-train technique that learns to predict intentionally hidden (masked) sections of text. Crucially, the representations learned by BERT have been shown to generalize well to downstream tasks. When BERT was first released in 2018, it achieved state-of-the-art results on many NLP benchmarks. Specifically, BERT is composed of a stack of N (
) identical Transformer blocks. We denote the Transformer block as
, in which N represents the BERT’s depth.
Firstly, we pack the sequence of vector inputs
as
, where the
is the initialized BERT embedding vector of the
i-th token of the sentence. Then the
transformer blocks refine the token-level semantic representation layer by layer. Taking the
j-th transformer blocks step as an example, the BERT hidden features
are calculated through Equation (9):
where the
denotes the
j-th BERT feature representations and the
denotes the
{j−1}-th BERT feature representations. Finally, we regard
as the contextualized representations of the input sentence, and our CasNSA’s other key components (
OTE and
TBR) use them for the further downstream model-reasoning step.
3.4. Target Boundary Recognizer
Similar to the CRF decoding layer, we employ two (start position and end position) binary classifiers with a softmax decoding layer on top of BERT as an opinion-target boundary tagging step. The two classifiers jointly mark each opinion-target’s start and end position as “1” and mark the current tags which are irrelevant to target boundaries as “0”. In particular, if the current boundary tag denotes the beginning of any opinion target, the “[START]” tagger which aims to detect the start position of this target is tagged with “1”, and the “[END]” is tagged with “0”; if the current boundary tag denotes the end of any opinion target, the “[END]” tagger which aims to detect the end position of this target is tagged with “’1’, and the “[START]” is tagged with “0”. The single-word opinion target position is tagged with “1” in both of the classifiers. During the multi-targets detection, we adopt the proximity principle, which regards the phrase between the “[START]” classifier’s “1”-tagged position and the corresponding nearest “[END]” classifier’s “1”-tagged position. We calculate the probability of whether the character is an opinion-targets boundary by the following formulations.
where the
represents the
i-th output vector of the
contextual semantic representation module,
and
are the learnable matrix weights and bias values of the “[START]” and “[END]” classifiers,
denotes the activation function, and the whole two formulas of
and
denote the encoded representations of the
i-th character in the input sentence. It considers a position to be a boundary and marks the position as “1” when the encoded representations
and
exceed a certain threshold (e.g., 0.5 in this paper), otherwise it regards it as a relevant character and marks it as “0”.
As shown in
Table 2, different from the traditional decoding layers (e.g., CRF), this example concisely illustrates our novel binary tagging strategy. More especially, the token “designs” is the first and also the last word of the opinion-target “designs”, so tags are both “1” in the “[START]” and “[END]” classifiers when recognizing these single-word target boundaries.
After the
target boundary recognizer generating the tag sequences
,
, we compute the objective by utilizing the binary cross-entropy loss function [
48]:
where
and
L denote the training set and the length of one sentence in
,
and
are the
i-th token’s gold tag and the predicted value by the “[START]” binary classifier,
and
are the
i-th token’s gold tag and the predicted value by the “[END]” binary classifier.
3.5. Sentiment Polarity Identifier
Similar to the
target boundary recognizer, the
sentiment polarity identifier also uses the
contextual semantic representation output features as its inputs. However, the key features that the target-oriented sentiment identification requires include (1) the depended opinion target; (2) the context that indicates the sentence-level sentiments; (3) the mutual relationships between the opinion-target features, and the contexts. Under these considerations, we propose the target and context joint-aware representation
, given a extracted
i-th opinion target in which its start and end indices of the sentence are
j and
k, and we define
as follows:
Formally, we take as , in which and denote the start and end position feature representations of the i-th opinion target. We regard the output vector which is located on the “[CLS]” position in BERT or “[START]” position in Bi-LSTM as the sentence-level context semantic . For multiple word opinion target representation where , we employ the mean-pooling operation to incorporate these word features to .
Then, the assembled features
which contain all relevant signals about the
i-th target for sentiment polarity identifying are sent to the
sentiment polarity identifier (we mark it as
), and then the SPI module predicts the corresponding sentiment polarity labels
.
The
SPI component’s feature decoding function
can be instantiated in many ways. In our experiments, we explore several methods to conduct the feature integration procedure, including (1) simple bitwise adding; (2) simple vector concatenation; and (3) CNN-based feature extraction:
These feature incorporation steps are illustrated in the top part of
Figure 1. Among them, after the bitwise adding layer and vector concatenate layer, an extra full connection layer is employed. The CNN-based method generates the intermediate feature by running a CNN on the character sequence of
, and the window size of CNN’s convolutional kernel for the feature-based vector is set to 3. In the last layer of the whole network, we add a linear layer whose output dimension is set to 3 in this setting, and the target-oriented sentiment polarity can be classified smoothly. For training, we perform a multi-label classification by adopting cross entropy [
48] as the loss function, and the loss of the sub-task sentiment polarity identifying is calculated by the following:
where
and
s denote all training samples and one training instance of
,
is the
i-th opinion target revealed in
s,
is the gold label of target-oriented sentiment polarity, and
is the sentiment prediction result.
It is worth noting that is the gold opinion target at the training time. In contrast, at the inference time, we select the predicted opinion target one by one from TBR module to complete the joint extraction task.
3.6. Training Objective of Target and Sentiment Joint Extractor
Following the previous unified work [
25,
26,
51], we combine the two sub-tasks loss function, and the objective
is defined as follows:
where
is a coefficient to moderate the mutual contribution weight of these two sub-tasks, and we set it as 1 during our experiments. We train and optimize the model parameters by minimizing
through the Adam stochastic gradient descent [
52] over shuffled mini-batches in which each batch contains 16 training samples.
5. Conclusions and Perspectives
Amidst the sentiment polarity distinguished for internet social user comments, we investigate the importance and effectiveness of interactive relations between sub-task opinion target extraction and sub-task target sentiment identification, which is always neglected by most of the related domain researchers. To some extent, many researchers are discouraged from performing further explorations, due to these two sub-tasks being highly coupled together. Specifically, we propose a novel collaborative learning framework named cascade social sentiment analysis (CasNSA) to tackle this critical challenge. Our CasNSA model takes advantage of the opinion target contextualized semantic features provided by the opinion target extraction sub-task to guide the sentiment polarity identification for predicting the aspects-oriented sentiment polarities more accurately. Extensive empirical results show that the proposed approach effectively achieves a 68.13% F1-score on SemEval-2014, 62.34% F1-score on SemEval-2015, 56.40% F1-score on SemEval-2016, and 50.05% F1-score on the Twitter dataset, which is higher than the existing approaches.
In particular, the most important novelty of this paper can be summarized in two main respects: (1) to solve the significant problems brought from the discrete modeling between the target sentiment and the opinion target extraction task, our target sentiment distinguish module utilizes the opinion target contextualized semantic features generated by the target extraction module when predicting the sentiment polarity, and thus ensures the sentiment polarity identification quality; (2) we investigate the superiority of the BERT encoding capability by introducing the BERT encoder as the social users’ comment contextualized feature representations generator. The model’s outstanding performance when using BERT fine-tuning firmly proves that the multi-head self-attention based Transformer is still predominant in capturing aspect-based sentiment and robust to the insufficient sample overfitting dilemma. Meanwhile, different from other pipeline methods, the novel unified framework CasNSA is designed to handle the aspect-based social comment sentiment analysis task in an end-to-end fashion. As a result, the CasNSA jointly predicts the target boundary position associated with the target-oriented sentiment polarity, thereby effectively tackling the incident error accumulation issue that exists in most pipeline methods.
Moreover, the empirical comparison results illustrate the superiority of our proposed model and the effectiveness of our proposed model’s sub-components, such as the hierarchical cascade sequence tagging unit and the BERT encoder. We believe E2E-ABSA will continue to be an attractive and promising research direction with realistic industrial and domestic scenarios, such as intelligent recommendation, smart personal assistant, big data mining services, and automatic customer services.
In the future, we plan to study the following major problems. (i) To support real-world dynamic application scenarios, the social comment sentiment analysis application is always updated quickly and inevitably needs to cover new scenarios in real time. How can we augment our framework’s business coverage for handling different scenarios automatically and incrementally? (ii) This framework is built on relatively small datasets under weak supervision without prior external knowledge. How can we introduce external knowledge such as world wide web textual target-entity descriptions and other open-domain knowledge to improve our CasNSA framework’s performance?