1. Introduction
In the age of social media, diverse industries and firms’ advisories broadly depend upon user-generated opinions for forecasting their future earnings. These opinionated unstructured data are available as reviews, blog discussions, graphics, audio, videos, and other types of media that do not support any structure. That made this field challenging due to ambiguities of natural language, exponential increases in social media web content, and indirect sentiments expressed in user-generated context [
1]. In that situation, data analysts widely considered ABSA to understand the users’ or consumers’ requirements, filtration of unrequired data, and obtain relevant suggestions that make their organizational and industrial decision appropriate. Generally, two types of online attitudes, reviews, or opinions are observed, i.e., product reviews and experience sharing regarding these products or services. The first type discusses features of a particular entity, such as a product or service. Moreover, the second type compares various features of entities to identify their pros and cons [
2].
The extraction of accurate features of a targeted entity became a critical issue in the NLP field due to the complex nature of contextual information. The contextual information around these advantageous features of a targeted entity considers highly important in these circumstances because they provide valuable clues for accurate identification and extraction [
3,
4]. Moreover, the precise identification and extraction of these features still demand attention from the research community. Traditionally, these extractions and identifications of features accomplish through various existing methodologies, such as machine learning [
5,
6,
7]; topic modeling [
8,
9,
10]; and lexicon-based [
11,
12,
13], rule-based, and syntactic relation-based [
14,
15,
16,
17,
18] methods. Syntactic pattern techniques perform well while extracting features and classifying their sentiments. However, these methodologies are discouraged due to their time consumption and specialists’ demand for creating rules and lexicons, which restrict them to a specific domain and language [
19,
20]. Additionally, the supervised methodologies of machine learning highly rely upon a large volume of labeled datasets that expresses a bottleneck of these methods [
21]. On the other side, the semi-supervised methodology demands less labeled data for training, whereas their methods’ complexity for feature selection becomes the prominent cause of obstinate such approaches. While unsupervised methods highly relied upon the manual feature engineering mechanism. The quality of such extracted features depends on the manual feature engineering process, which affects the scalability and adaptability of these approaches to various fields’ applications in daily life [
22,
23].
One of the most highly recommended approaches in the machine learning field is known as deep learning (DL). It addresses diverse NLP challenges such as machine translation, named entity reorganization, and sentiment analysis (SA) [
24]. Recent advancements in the field of NLP highly relied upon DL architectures for the extraction of such valuable features along with their sentiment classification. Recurrent and convolutional neural networks are the two well-leading architectures of DL methodologies. CNN has achieved that position due to the utilization of convolution kernels, which make it distinctive while extracting targeted features. On the other hand, recurrent neural networks (RNN) and their variations, such as long short-term memory (LSTM) or GRU, do not resemble during the extraction of contextual information, which can be versatile under varying circumstances [
25].
RNN analyzes the whole sentence, word by word, to capture the semantic information of the sentence in the form of hidden layers. It also captures long semantical dependencies of long-contextual facts but in a biased manner. It means each upcoming word is more dominant compared to the previous work. However, worthwhile terms/words can occur in any position of a sentence, which becomes the prominent cause of the model’s effectiveness reduction. Generally, the RNN-based model captures the sequential patterns through temporal features and long-term semantical dependencies among the pairs of words/terms while learning. In addition, these methods equally drew their attention to each word of the targeted sentence. Due to this, they did not distinguish between ordinary and prominent words, which are (compared to common words) more dominant and influential in contextual knowledge. It becomes the cause of performance degradation of RNN-based approaches [
25,
26,
27].
On the other side, CNN has presented itself as a non-biased model. It comprises kernels and max-pooling layers to extract the prominent features of a targeted entity within a sentence. As a result, CNN captures the semantics of a sentence more effectively when compared to RNN. However, CNN is facing an issue while determining the optimistic size of the kernel. It means that their small size may lose some critical information. While on the other hand, the larger size may consider all non-required terms, which could cause erroneous training of the model. In addition, their utilization of filters effectively captures the local features that proved beneficial during the extraction of semantically highly effective terms. Moreover, such a model never demands order-sensitive long-term semantic dependencies of the sentence during training. Instead, they can only be trained based on the local feature information [
25,
26,
28].
Recently, the researchers presented various developments in this trend with the involvement of the attention mechanism, which improves sentiment classification with a credible exposition of opinion targets [
29]. Thus, we can conclude that both models (CNN and RNN) separately cannot deliver state-of-the-art performance while extracting feature terms and classifying their sentiments. Moreover, their combination can improve sentiment classification through the accurate extraction of features. Consequently, the existing approaches regarding feature terms identification and extraction found in the literature have considered such methodology. However, they combined them in a sequential/serial and parallel manner. In the case of the serial method, one of the models has obtained the actual text while the other acquires the output of the first model, which becomes the cause of information loss. In the case of the parallel technique, both models have never been treated equally in terms of inputs. It means that one of the models uses a whole set of inputs while the other obtains its subset. This situation underutilizes the abilities of one model compared to the other, which obstinate the benefits of parallel or simultaneous algorithms adhesion approach. This scenario demands the research community to develop such techniques where both models directly interact with the textual context in the form of an equal number of input parameters and then combine their learnings to exploit these models.
Both above-mentioned algorithms have their distinctive pros and cons. In the state of these motivations, the attention-based joint model (Att-JM) jointly utilizes them and applies the attention mechanisms to these for accurately identifying informative features and classifying their expressed sentiments. Consequently, while integrating these algorithms parallelly, the Att-JM shares information of hidden layers between them to achieve the benefits of their combined learnings. Additionally, it distributes an equal number of inputs between them to accomplish the main tasks of aspect-based sentiment analysis (ABSA). Thus, the main contribution of this paper has given as under:
The proposed approach proposes the parallel fusion of the multichannel convolutional neural network (MC-CNN) and multichannel gated recurrent unit (MC-GRU) with various deep features while accomplishing the main tasks of ABSA.
The approach proposes and explores the collective utilization and the uniform distribution effect of word2vec embedding and contextual position information on the performance of the merged deep learning-based algorithm during the aspect’s identification and sentiment classification.
The proposed approach shares the information of hidden layers between merged models; thus, they attain the advantages of their combined abilities and learnings while predicting aspects and classifying their sentiments.
The proposed approach outperforms when assessed through evaluation metrics, e.g., precision, recall, and F1 measure on standardized datasets comprising SemEval and Twitter. Therefore, the F1 measure depicts 95% achievement of the proposed approach in the aspect term extraction (ATE) task and a performance of 92% in the sentiment classification (SC) task.
The rest of the paper is organized as follows: Existing work related to aspect extraction and sentiment classification is presented as “Related Work” in
Section 2.
Section 3 expresses the overall methodology and detail of the proposed approach as the “Proposed Research Methodology”. The experimental environment considered for the development of this model is described in
Section 4 as “Experimental Arrangements”. The performance comparison of this model is presented in
Section 5 as “Results and Discussion”. The last section,
Section 6, concludes the whole approach with future work as “Conclusion”.
2. Related Work
The extraction of sentiment from a given piece of text is known as SA. Estimating the feelings, thoughts, and attitudes of users towards groups, individuals, and products or brands is the leading initiative of SA. It can automatically recognize and extract aspects and their corresponding opinions and then classifies their polarities related to these opinions from online textual reviews [
30]. In addition, document-level and sentence-level SA cannot explain users’ likes or dislikes relating to a specific feature of an entity. They only focus on the entire document or sentence-oriented sentiments, which cannot be beneficial in all daily life scenarios. Sometimes, users are interested in specific aspects of products or services (that are out of the scope of document and sentence level SA), which demands ABSA for handling such a scenario. It is a unique genre of text-mining and a fine-grained SA, which can extract aspects and their corresponding sentiment polarities from a sentence. Moreover, it can summarize these sentiments related to aspects of user-generated reviews, which is a general task of ABSA [
31,
32].
Additionally, diverse challenges regarding ABSA have still become the cause of their performance degradation, such as identification of those textual parts of context that depict identical aspects, determining the relationship between features and the text, and handling comparative sentences [
33]. In addition, the task of ABSA is accomplished in two phases. The first phase specifies all the conceivable aspects (either implicit or explicit) related to a specific topic or product. As a result, all the potential features are known and available to nominate their corresponding polarities. Therefore, the task of the second phase is to determine the sentiment polarities and assign them to their interrelated aspects. Moreover, explicit aspect extraction/identification has been considered the main subtask of ABSA. It is related to the extraction of those aspects or features of a specific entity that are explicitly mentioned in a review and discussed among users upon which they express their opinions and comments [
34].
In the early days of ABSA, traditional approaches regarding aspect identification/extraction have highly relied upon machine learning methods (e.g., Nearest Neighbor, Support Vector Machine, Naïve Bayes, Decision Tree), nouns frequency-based, lexicons (e.g., WordNet, SenticNet), Topic modeling (e.g., LDA), n-grams combination, and rule-based approaches. These feature engineering-based procedures depend on manual annotations, rule creations, and handicraft features, which are laborious, time-consuming, domain-dependent, and cause performance bottlenecks [
35]. The remarkable achievements of DL methodologies in NLP inspired researchers to involve these methodologies while accomplishing the main tasks of SA. At present, DL methods are famous for ABSA tasks, but the inclusion of reasoning, such as the human brain, is still an open research area for future contributions of researchers [
36,
37]. The success of DL methodologies in the NLP task made them commendable and created space for applying these methods within the task of aspect extraction and classifying their corresponding sentiments [
38].
According to the literature, these aspects are extracted based on handicraft features. These are laborious and complicated or time-consuming methods that demand much effort from analysts. This motivated Xu et al. [
39] to present a supervised approach that identifies the potential features using the DL method. Therefore, their methodology uses CNN with two embedding layers. One of the embeddings has trained generally, whereas the other acquired training according to a specific domain for extracting aspects. Additionally, Shu et al. [
40] proposed a modified form of CNN named controlled CNN (Ctrl). It consists of two control modules: the embedding control and the CNN control module. Asynchronous updating of CNN’s parameters prevents it from over-fitting while boosting the model performance significantly. Furthermore, A. Da’u and N. Salim [
41] presented a model based on a multiple-channel CNN that uses word embedding and PoS tags as textual features. The introduced model was presented along two input channels. Among those, the first channel takes the word embedding as input, while the other takes PoS tag sequential information as input for aspects identification.
RNN-based approaches provide state-of-the-art performance due to their dependence on long-term dependencies with temporal features that enhance their learning procedure of textual representation and sequential information. Due to this, Li et al. [
42] proposed a framework comprised of two LSTMs that perform ATE based on the summary of previously identified opinions and aspects. At each step based on historical information, the historic attention truncates the unrequired terms from the recently predicted representations and identifies valuable features. On the other hand, Saraiva et al. [
43] proposed an approach known as POS-AttWD-BLSTM-CRF, which utilizes the attention mechanism as an encoder that determines the grammatical dependencies found among the targeted words. Instead of electing a subsection of PoS-tagged features, their approach selects the most relevant among them. These features collectively provided to the Bi-LSTM-CRF classifier that accomplishes the task of ABSA.
Exploration of the relevant literature represents that the combination of Bi-GRU and conditional random field (CRF) is the most widely utilized method to accomplish the most challenging tasks of ABSA, i.e., aspect terms identification/extraction and their sentiment classification. These models become trained through SemEval 2014 labeled dataset, which uses either pre-trained GloVe or word2vec embedding [
44]. In addition, the literature depicts that supervised methods outperformed as compared to rule-based methods. In order to achieve such high performance, these methodologies have to pay in the form of a large volume of annotated or labeled samples to train their models, which is time-consuming and expensive. This situation motivated Wu et al. [
45] to propose a hybrid unsupervised approach that performs the identification of aspects from the targeted context. Hence, they combined linguistic-based rules and GRU algorithms to classify the targeted terms as aspect and non-aspect. Moreover, the accurate aspect terms extraction and identification phenomena depend upon long-term dependencies of the textual sentence or noun phrases, which makes them inadequate in the case of usability (such as across-domain scenarios) and accuracy. With these motivations, Chauhan et al. [
46] proposed a hybrid two-step unsupervised model, which integrates linguistic patterns and an attention-based Bi-LSTM to perform the task of ATE. The first step comprises the linguistic rules and is responsible for extracting potential aspects composed of single or multi-words. Domain correlation then evolves to filter these terms, which relate to a specific domain. These filtered terms transform into a fine-tuned word embedding. Furthermore, these determined aspects from the first step utilize as labelled data during the second step. Based upon these identified terms, the training of the attention-based Bi-LSTM model is accomplished.
The analysis of the relevant literature exhibits that the field of ABSA highly utilizes RNN models, which perform superlatively. However, some inferiorities are identified within them regarding position invariance and local pattern sensitivity. As a result, these inferiorities raised the question of their performance; to prevail over this scenario, CNN presents its services, but long-term dependencies and sequence information modeling again raise difficulties in their success. Consequently, recent research trends changed towards the models’ fusion approach that enhances their ability of feature identification and extraction [
47,
48], which encouraged Zhu et al. [
49] to present an aspect-level attention-based recurrent convolutional neural network (AARCNN) model that extracts aspect-based sentiment from user-generated reviews and comments. Their approach combines the targeted information with the attention mechanism that enables the model to concentrate on the exact targeted aspects. Therefore, their models’ Bi-LSTM provides entire sentence representation to CNN and from which it extracts the highly attentive parts of a sentence as potential features along with their sentiments. Under this influence, Akhtar et al. [
50] proposed an approach for extracting aspects and classifying sentiment polarity using bi-directional long short-term memory (Bi-LSTM) with the CNN network for transfer learning perspective. Thus, their approach’s Bi-LSTM learns the sequential pattern for predicting aspect terms of the provided review sentences. Additionally, CNN has been used to acquire local features related to identified aspect terms for SC. Both algorithms are used jointly in a serial manner to enhance the prediction rate of both before-mentioned tasks.
According to the literature, existing approaches have considered model fusion mechanisms, but they widely combined these models sequentially. It means that one of the models has received the actual input while the other acquires the output of the first model, which becomes the cause of information loss. This situation motivates researchers to consider such approaches, where both algorithms fetched the same input simultaneously and then combined their learnings to exploit their combined abilities. With this motivation, Guo et al. [
25] proposed a hybrid parallel approach named CRAN, which comprises a CNN and Bi-GRU. This approach is based upon the attention mechanism, in which the main objective is to combine both CNN and GRU output. In addition, their approach highlights those terms that acquire the focus of the whole sentence’s contextual information. The sequential information gathered from GRU visualizes these valuable features and extracts them in light of contextual information learned from CNN. It also preserves the semantic information illustrated in the targeted sentence with the collaborative learning of both merged algorithms. However, the model performs well but lacks the knowledge of contextual positional information and PoS tags. It also excludes the information sharing of hidden layers among combined algorithms that can enhance the identification ability of valuable attributes.
Moreover, the analysis of the relevant literature on ABSA also observed those contributions that integrate both strategies (i.e., parallel and sequential). In this way, Yu et al. [
3] proposed an approach (named ABLGCNN) that utilizes two parallel RNN architectures (comprising either LSTM or GRU) and separately joins each of them serially with CNN. However, according to the literature, RNNs can capture global contextual information but cannot capture local features efficiently, which becomes the cause of information loss in their approach. However, the model performs well, but the serial combination of RNN and CNN makes their model complex and becomes the cause of losing beneficential information. In addition, the attention mechanism has been applied only to the output of RNN while ignoring the outcome of CNN. The integrated algorithms lack the information transfer of contextual positions of context and the information concerning hidden layers. These issues can enhance the performance of the model during the classification process. Another hybrid approach named CNN_BiLSTM [
51] also performs SC that relies on the parallel combination of both CNN and BiLSTM. Their CNN model extracts local features, whereas BiLSTM captures global contextual information. These extracted features have been combined and passed towards the softmax function for accomplishing the classification procedure. Both CNN and BiLSTM take only word2vec embedding as input. Whereas CNN comprises three input channels, word2vec embedding is the only parameter provided as input. In addition, Zhang et al. [
52] proposed a parallel approach that combines multi-attention CNN and Bi-GRU. Three types of inputs, including attention-oriented word vector, PoS, and position information, have been provided to the multi-attention CNN. Paradoxically their Bi-GRU phase receives only word embedding (without an attention mechanism) for acquiring the contextual semantic facts from the targeted context, which determines sentence-oriented sentiment polarities. Their approach delivers inputs contrastively among variant algorithms such as attention-oriented vectors regarding PoS, and position information is only provided to CNN and does not deliver to the Bi-GRU, which is the prime deficiency of their approach. However, the involvement of these neglected features can further enhance their approach identification performance and their ability to predict their corresponding sentiment. In another contribution, Cheng et al. [
53] proposed one more parallel procedure, which accomplishes text sentiment analysis and is composed of attention-based MC-CNN and attention-based Bi-GRU. The algorithms (MC-CNN and Bi-GRU) used within this framework utilize only attention-based word2vec embedding as the input parameter. However, they ignore the consideration of contextual positional information, dependency-based relations and even the sharing of hidden layers’ information while accomplishing the classification. The reference of these features can enhance the performance and accuracy of the prescribed framework.
The traditional approaches are lacking in considering the influence of interrelated contextual words and the relationship between aspect terms and contextual phrases based on their distance. Therefore, Huang et al. [
54] proposed an approach named CPA-SA, which accomplishes the task of ABSA while considering aspect-specific contextual location information. Their designed function has adjusted the weight of contextual words according to the position of potential terms alleviating the terms’ inference on both sides of conceivable terms to determine their corresponding polarities. Additionally, this approach excluded the influence of syntactic and semantic relations during the accomplishment of this task. The past deep learning approaches massively used either pre-trained language models or attention mechanisms, which apply similar attention weights to the whole context without any restriction on assigning the attention. Hence Feng et al. [
55] proposed an approach that utilizes the attention procedure with a masked mechanism that imposes a threshold for attention weights and keeps only those scores above that threshold while removing the lower score terms. This approach only focuses on word2vec-based knowledge, whereas the importance of contextual position information is out of its scope. Moreover, Liao et al. [
56] proposed an approach named FAPN, which is a phrase-aware CNN-based fine-grained attention mechanism that captures the word-level relations between the aspect and their corresponding context. Their methodology focuses only on the local contextual information, whereas the global contextual information neglects during this procedure.
Recapitulating the above discussion, we conclude that whenever either CNN or RNN implements as individuals, they can never perform extraordinarily. Moreover, their sequential combination losses valuable information. In addition, within a real-world scenario, a model’s simple design always performs well and proves beneficial compared to a complicated model. In the relevant literature, the MC-BiGRU model conceivably never combines with other algorithms in a parallel perspective, which motivates the Att-JM to merge both algorithms (MC-BiGRU and MC-CNN) in one model. It enhances the performance and keeps the model architecture simple. According to the relevant literature on ABSA and best of our knowledge, the Att-JM is the first technique that integrates Att-MC-BiGRU with Att-MC-CNN, which shares their hidden layers for transfer learning. These discriminations improve the task of aspect extraction and sentiment prediction from the textual reviews and emerge as the main novelty of this methodology. Moreover, the utilization of attention mechanisms and contextual position information becomes the cause of accurately identifying aspects and classifying their sentiments. Furthermore, the proposed approach adequately distributes the input parameters, such as positional information and word2vec embedding, between algorithms. Due to this, they exploited their entire abilities while learning valuable features and then combined these learnings during the identification and extraction of targeted aspects with their corresponding sentiments, which became the cause of the distinction of the Att-JM compared to existing approaches.