EmoBERTa-X: Advanced Emotion Classifier with Multi-Head Attention and DES for Multilabel Emotion Classification

Labib, Farah Hassan; Elagamy, Mazen; Saleh, Sherine Nagy

doi:10.3390/bdcc9020048

Open AccessArticle

EmoBERTa-X: Advanced Emotion Classifier with Multi-Head Attention and DES for Multilabel Emotion Classification

by

Farah Hassan Labib

^*,

Mazen Elagamy

and

Sherine Nagy Saleh

^*

Computer Engineering Department, College of Engineering and Technology, Arab Academy for Science, Technology and Maritime Transport, Alexandria 1029, Egypt

^*

Authors to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(2), 48; https://doi.org/10.3390/bdcc9020048

Submission received: 17 November 2024 / Revised: 5 February 2025 / Accepted: 17 February 2025 / Published: 19 February 2025

(This article belongs to the Special Issue Advances in Natural Language Processing and Text Mining)

Download

Browse Figures

Versions Notes

Abstract

:

The rising prevalence of social media turns them into huge, rich repositories of human emotions. Understanding and categorizing human emotion from social media content is of fundamental importance for many reasons, such as improvement of user experience, monitoring of public sentiment, support for mental health, and enhancement of focused marketing strategies. However, social media text is often unstructured and ambiguous; hence, extracting meaningful emotional information is difficult. Thus, effective emotion classification needs advanced techniques. This article proposes a novel model, EmoBERTa-X, to enhance performance in multilabel emotion classification, particularly in informal and ambiguous social media texts. Attention mechanisms combined with ensemble learning, supported by preprocessing steps, help in avoiding issues such as class imbalance of the dataset, ambiguity in short texts, and the inherent complexities of multilabel classification. The experimental results on the GoEmotions dataset indicate that EmoBERTa-X has outperformed state-of-the-art models on fine-grained emotion-detection tasks in social media expressions with an accuracy increase of 4.32% over some popular approaches.

Keywords:

emotion classification; multilabel classification; multi-head attention; DES

1. Introduction

In today’s hyper-connected world, social media platforms like Facebook, Instagram, Reddit, and Twitter have become emotional microphones where people freely express their happiness, dissatisfaction, and various other emotions. Considering the enormous amount of available user-generated content, there is an emerging need to understand these emotions by appropriate detection with confidence, using public sentiment analysis [1]. Emotion classification is thus one of the important tasks of sentiment analysis that delineates the role of companies and academics in interpreting emotional cues. It finds applications in marketing and customer relationship domains wherein the identification of consumers’ emotional responses could help develop better strategies [2,3].

Emotion classification finds important applications in real-time services such as crisis intervention, where a timely detection of a distress signal from social media or any other communication platform helps mental health professionals and emergency responders reach out to them immediately, thus preventing further escalation. For example, studies have shown that methodologies for emotion classification go a long way in visually understanding emotional trends that may help authorities identify people in crises [4]. On the other hand, high-accuracy, real-time emotion detection has the potential to revolutionize human-computer interactions and offer a proactive approach toward mental health care by offering timely interventions and support for people affected by traumatic events [5].

Additionally, to be able to identify psychological discomfort early and enable quick therapies, emotion detection is becoming more and more important in the field of mental health. Using artificial intelligence (AI) and machine learning (ML), combined with techniques of natural language processing (NLP), researchers today can collect massive social media data to indicate depressive episodes and suicidal thoughts as symptoms of a mental health crisis with reasonable accuracy [6]. An AI-driven model, for instance, can recognize emotional distress indicators up to a week before human specialists do, providing a crucial window of time for treatment [7]. Emotion-detection technologies make real-time monitoring and early crisis diagnosis possible, which are an essential part of modern mental health therapy.

Traditional approaches have relied on rule-based and lexicon-based approaches, which rely on pre-defined dictionaries of words related to particular emotions [8,9]. These early models were simple and interpretable; hence, they were successful in the initial phases of sentiment analysis. However, their reliance on static dictionaries diminished their ability to capture the meaning of words in a complex and contextual manner for instance, the meaning of the word “cool” might mean a temperature in a post about the weather, but in social media slang, it means approval or admiration [10]. Also, the lexicon-based model was having a hard time keeping up with the informal and constantly changing language that has more often been used on social media platforms [10,11]. Feature-based learning statistical models, such as Support Vector Machines (SVMs) and Naive Bayes, overcame these drawbacks by training them using labeled data instead of relying on pre-built static dictionaries [12]. These models have improved generalization but are still based on manual feature engineering and, hence, are labor-intensive and not effective for complex multilabel emotion-classification tasks [13,14].

Sarcasm tweets, for example, such as “Oh great, just another Monday”, are a lot tougher to classify because such text, in turn, calls for an understanding of the intended sentiment to be expressed beyond just explicit keywords. Here, sarcasm detection comes in handy regarding sentiment analysis because most comments in sarcasm contain positive words that denote negative or undesirable meanings, making it difficult for traditional sentiment analysis algorithms to comprehend the intended sentiment of the text correctly [15]. Among these challenges, a number of researchers sought various approaches, for instance, pattern-based methods that identify linguistic patterns commonly associated with sarcasm. It is with these methods that remarkable improvements in the accuracy of detecting sarcastic tweets have been achieved [16].

However, deep learning models such as Convolutional Neural Networks (CNNs) and recurrent Neural Networks (RNNs), particularly Long Short-Term Memory (LSTM) networks, made major improvements in emotion classification. CNNs specialize in detecting local relationships between words, making them excellent at spotting patterns within text [17]. LSTMs, a particular kind of RNNs, can model long-term dependencies, processing sequential data and capturing emotional patterns over larger texts [18]. Despite their success over classical approaches, CNNs and LSTMs have limitations. CNNs are effective at recognizing local patterns but fail to catch the larger context of a phrase due to narrow text window operation [19]. Also, LSTMs are more suitable for long-term dependencies but suffer from vanishing gradient difficulties during training, especially with longer sequences [20]. On top of that, CNNs and LSTMs require huge volumes of labeled data, making them less useful in low-resource contexts [21].

Presently, transformer-based models like Bidirectional Encoder Representations from Transformers (BERT) and the robustly optimized BERT approach (RoBERTa) have raised the emotion-classification standards [22]. To provide a more complex understanding of emotional expressions, these models rely on self-attention processes that collect the context from both the left and right of a target word [23]. Transformers have shown exceptional effectiveness across a variety of NLP tasks, including emotion classification because they can pre-train on vast corpora and fine-tune on specific tasks [22,23].

Transformers are a great invention, but they have many limitations. For instance, short and unclear texts, which are frequently found on social media, are a challenge for them [23]. Additionally, transformers struggle to generalize successfully in situations involving informal language or slang, which differs among cultures and regions. For instance, “This party is so lit I’m literally dead” if it were to mean something excitedly happy, it might be misinterpreted by the transformer because of the words “dead” and “lit” into something completely different. More importantly, expressions of this nature have always been elusive in traditional sentiment analysis models since these informal language characteristics can affect their performance in the text of social media [24]. Furthermore, cultural awareness provides a strong basis for sentiment analysis, whereby language uses vary across different speech communities and demands models that apply to various linguistic contexts [25].

Furthermore, transformers are challenging to use in low-resource environments because they require significant computational for both training and inference [22]. In addition, transformers still have difficulties with multilabel classification, which is the categorization of many emotions present in a single text passage, despite their improvements in emotion classification [22]. Finally, hybrid models have been suggested as a solution to the limitation of separate models. Combining the benefits of many models, such as CNNs and LSTMs, these improve performance in emotion classification [26]. However, they also continue to face challenges with multilabel classification and informal language, even with gains in performance; this is especially true when it comes to social media [27].

Although emotion classification has evolved, it is still facing a tough task given short sentences, informal language, and multilabel classification, which are very common on social media platforms. Deep learning models, including transformer-based, require extensive processing resources and large datasets, showing poor results in handling multilabel and informal text. Emotion classification on social media presents unique challenges due to informal and ambiguous text. Although traditional models struggle to generalize, transformer-based models like RoBERTa have shown improvements in contextual understanding. Recent advances in ensemble learning approaches for multilabel classification have shown significant improvements in handling highly unbalanced datasets, particularly in social media applications such as vaccine-related discourse classification [28].

In this article, we propose EmoBERTa-X, an enhanced dynamic ensemble selection (DES) framework integrated with a modified attention mechanism to best handle multilabel emotion classification. DES has recently emerged as one of the promising approaches for adapting a classification decision based on the varying complexity of the input. DES frameworks offer adaptability by selecting the most competent classifiers for a given instance based on their historical performance and context. Research suggests that DES can enhance the selection of relevant features for tasks related to dynamic emotion recognition [29]. These approaches generally fail in emotion classification, especially in social media text, which is essentially short, ambiguous, and informal. Social media platforms introduce dynamic challenges, such as concept drift and the evolution of language patterns that raise the demand for robust and adaptive classification techniques [30].

The existing DES-based approaches for emotion classification are limited by their reliance on static feature selection strategies, which fail to capture the dynamic, multilabel nature of emotions expressed in overlapping and context-dependent ways [31]. Similarly, while transformers like BERT and RoBERTa have achieved state-of-the-art results in NLP tasks, their performance suffers when dealing with informal language, class imbalances, and the complex multilabel structures often present in social media data [32].

To address these challenges, EmoBERTa-X enhanced the DES framework along with a multi-head attention mechanism to create a robust solution for multilabel emotion classification.

Together, dynamic ensemble selection and multi-head attention enhance the model’s attention to emotional cues and allow DES to adapt dynamically to shifting input complexities. Our hybrid approach effectively addresses the challenges posed by informal text and multilabel categorization, providing the best solution when compared to simpler alternatives. For example, a study on multilabel text classification proposed a model that fully exploits the semantic information inherent in labels utilizing BERT and a label attention mechanism to show how well attention mechanisms handle challenging classification problems [33].

In addition, label attention and correlation network studies in multilabel text classification have shown the efficiency of the attention mechanism to grasp label relations; therefore, this points to more integration of attention mechanisms in multilabel classification frameworks [34].

The major contributions of this work are:

Enhanced DES Framework: We propose enhancements to the internal structure of the DES framework by optimizing the handling of the complexities typical in short and ambiguous texts found in social media.
Integration with a multi-head Attention Mechanism: It extends the attention mechanism to enhance model focuses on relevant emotional cues within the text, which is very useful in the case of multilabels.
Advanced Preprocessing: This includes new preprocessing steps, such as abbreviation expansion and enhancement in the context of embeddings, which cope better with the informality of the language.

Taken together, these enhancements contribute to iterative gains in emotion classification, helping in better coping with informal and dynamic social media contexts. The rest of this article is structured as follows: Section 2 introduces the applied methodology, including the proposed model and the integration of the DES framework; Section 3 describes the dataset and its challenges, in addition to the set of experiments and their results; Section 4 will conclude it all.

2. Applied Methodology

This section describes the architecture and methodology of the proposed model EmoBERTa-X, shown in Figure 1, which is an advanced multilabel emotion-classification system that incorporates the DES [35] framework with a value-added RoBERTa model embedding a multi-head attention mechanism. EmoBERTa-X is designed to deal with complex problems inherent in unstructured, ambiguous, and informal social media text to optimize diversified contexts and label distributions.

2.1. Advanced Preprocessing Techniques

Given that the social media text is unstructured and generally made up of abbreviations, slang, informal language, and short sequences of words, EmoBERTa-X uses an extensive preprocessing pipeline so as to maximize its usefulness for emotion classification. The purpose of this is to really extend the linguistic and semantic grasp of the model by ensuring that incoming data are preprocessed to a form from which effective detection of emotions can occur.

2.1.1. Abbreviation and Slang Expansion Module

EmoBERTa-X embeds the Abbreviation and Slang Expansion Module (ASEM) that can expand the usage of informal language on social media venues such as Twitter or Reddit into normalized English. It uses a dictionary that has been curated by hand, along with an expansion algorithm based on dynamic context.

Common abbreviations, such as “lol” for “laugh out loud”, and slang terms, such as “omg” for “oh my god”, come first in the dictionary. Each abbreviation is mapped to its expanded form, providing a reference for the module. The context-based expansion algorithm dynamically applies these mappings at the time of tokenization. When an abbreviation or slang term is encountered, the module considers its context in the sentence to choose an appropriate expansion. This context-sensitive approach ensures that abbreviations with multiple meanings are appropriately expanded according to usage. For example, it can take an informal sentence like “brb, gotta go 2nite, lol” and expand it into “be right back, got to go tonight, laugh out loud”. This process demonstrates how the module transforms social media vernacular into standardized writing, improving readability and sentiment analysis accuracy.

ASEM is crucial for EmoBERTa-X, as social media textual content is often filled with variability and informality. The preprocessing here can allow the model to handle such inputs that do not fit the mold of the standard language.

2.1.2. Context-Sensitive Embedding Refinement

EmoBERTa-X adopts the Context-Sensitive Embedding Refinement (CSER) mechanism to align the RoBERTa embeddings with the diverse and changing patterns of social media language. This post-adjustment process recalibrates RoBERTa-generated embeddings to be sensitive to informal and contextually specific variations of language.

This transformation is learned during training and helps align the embeddings with sentiment-specific and contextually relevant features. Furthermore, it has recalibrated the embeddings present in CSER to improve the model reading informal and ambiguous texts frequently occurring on social media.

2.1.3. Token Variation and Noise Reduction

Social media posts are very often prone to typos, informal spelling variants, and other forms of noise. In this respect, EmoBERTa-X follows a Token Variation and Noise Reduction (TVTVNR)-based strategy in its preprocessing pipeline. TVNR is about making sure the model is robust against various linguistic diversities, which are inherent parts of user-generated content.

Synonym Replacement and Token Normalization: TVNR incorporates synonym replacement and token normalization rules. Variants in spelling and loose usages like “gr8” for “great” are replaced with their normalized forms.
Augmentation Strategies: TVNR also performs token-level augmentations like random insertion and synonym replacement during training to increase its robustness. These augmentations will allow the model to generalize better in scenarios where there are token variations it has not yet seen at test time.
Noise Filtering: The module further tries to reduce the impact of typos or characters irrelevant to the context, such as inappropriate punctuation or emojis, by filtering them out or replacing these types of noise with tokens contextually appropriate.

By reducing noise and ensuring consistency in token representation, TVNR enhances the model’s ability to accurately identify multiple emotions present within a single text, therefore improving multilabel categorization performance.

2.1.4. Handling Text Length and Padding (HTLP)

Given that EmoBERTa-X processes social media posts of various lengths, mechanisms for balancing the length of texts or padding have also been added to the preprocessing steps.

Each post has been truncated or padded to a fixed length, such as 128 tokens, using a RoBERTa tokenizer. This step also results in uniformity of the input size, which is desirable for neural networks since it allows batch processing. This also involves the creation of attention masks, marking which tokens are actual input data and which are just padding. In this way, the multi-head attention mechanism focuses only on the relevant portions of the sequence.

The advanced preprocessing pipeline in EmoBERTa-X ensures that the informal and diverse nature of social media text can be effectively captured in the model. Expansion of abbreviations, refinement of embedding, and reduction of noise enhance input quality that can help make it compatible with a sophisticated multi-head attention mechanism and dynamic ensemble framework. Essentially, these preprocessing techniques are useful for optimizing model performance on a wide range of emotion-classification scenarios.

2.2. Comprehensive Model Framework

EmoBERTa-X is a complex and adaptive architecture that leverages the framework of DES in multiple tiers. With such a position, EmoBERTa-X gains the distinction from traditional static models of ensembles by allowing the dynamical creation of an optimal subset of classifiers according to each input instance’s features, reaching higher degrees of flexibility, adaptability, and precision.

The core of the DES framework is this meta-learning component, a high-order learning strategy responsible for optimizing the configuration where the models in the ensemble are the base ones. In contrast to having a fixed set of classifiers that are uniform in all new inputs, the adaptation capabilities of the meta-learning component allow a strategic sense of how to combine the models given contextual and historical information about the input instances. It continuously learns from performance metrics, and input features the optimal weighting and classifier selection to make the classifier maximize classification accuracy in real time.

Given an input instance x, the meta-learning component computes the optimal subset of classifiers

C^{*}

using the following function [36]:

C^{*} = arg max_{C \subseteq C} \sum_{c \in C} α_{c} \cdot f_{c} (x)

(1)

where:

$C^{*}$ denotes the complete set of available classifiers,
$arg {max}_{C \subseteq C}$ represents the subset C of classifiers within the ensemble $C$ that maximizes the weighted sum.
$α_{c}$ is a weight parameter assigned to each classifier c, learned through the meta-learning process to reflect its effectiveness in the current context,
$f_{c} (x)$ is the output from classifier c for the input instance x,

The meta-learning component iteratively updates these weights (

α_{c}

), aided by a meta-objective function that may contain manifold evaluation metrics such as micro, macro, and weighted F1-scores. Past performance and nature of the recent data samples are analyzed to update the weights dynamically by this component to improve the relevance and contribution of each classifier. This approach enables finer tuning and optimization of the ensemble method to have a higher degree of precision because the classifier selection aligns with the fine details associated with each input.

Early stopping based on the training F1-score was also employed for the meta-learning component to prevent overfitting by being on the lookout for its general performance on the validation data and then automatically stopping the training when it has stabilized. This makes the ensemble generalizable across a wide variety of social media data.

2.3. Integration of Multi-Head Attention with EmoBERTa-X

EmoBERTa-X integrates a sophisticated multi-head attention mechanism with the RoBERTa base architecture [37] and expands its capability toward multilabel emotion classification found in social media text. The proposed model attends to multiple linguistic, semantic, and syntactic features of input text at the same time, which is highly crucial for identifying correct and often overlapping emotional signals present in short pieces of text in an informal manner. EmoBERTa-X will update RoBERTa’s internal architecture by embedding h, a parallel attention heads mechanism within its layers, as depicted in Figure 2. It outputs RoBERTa’s last hidden states, which have been further enhanced by the attention mechanism to better classify emotional expression.

Attention Mechanism Configuration: The multi-head attention is defined with an embedding dimension that is specified to match RoBERTa’s hidden size. These h parallel attention heads compute the attention scores for different parts of the input sequence and work simultaneously. This parallelism helps the model capture diverse linguistic patterns along with emotional cues present in social media text.
Mathematical Formulation: The attention mechanism computes the scores using the query, key, and value matrices derived from the hidden states output by RoBERTa. For each attention head i, the attention scores $A_{i}$ are computed as [38]:

$A_{i} = softmax (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{k}}}) V_{i}$

(2)

where:
- $A_{i}$ represents the output of the i-th attention head after applying the softmax function to normalize the scores,
- The softmax function normalizes the attention scores, ensuring that they represent the relevance of each token in the sequence relative to the others,
- $Q_{i}$ , $K_{i}$ , and $V_{i}$ are the query, key, and value projections for the i-th attention head,
- $d_{k}$ is the dimension of the key vectors.
Combining Attention Heads:
The outputs from all attention heads are concatenated to provide a unified representation. This concatenation itself summarizes the evidence provided by each head; hence, the model can jointly mesh various emotional and contextual features at once for the input sequence. Then, this combined output undergoes a linear transformation by a weight matrix $W_{o}$ to align it with the original embedding dimension [38].

$O = W^{o} \cdot concat (A_{1}, A_{2}, \dots, A_{h})$

(3)

This transformation combines the various pieces of information that each head captures into a single representation that carries syntactic and semantic information in an integrated way.
Dropout and Pooling Operations:
A dropout layer to the output of the multi-head attention mechanism is applied as a regularization technique to prevent overfitting and improve the generalizability of this model. It works by randomly disabling a fraction of the units in the attention during training, ensuring none of these pathways are relied upon too heavily. Afterward, the model performs an average pooling along the sequence dimension. This combines the information from the attended representations through averaging of token representations to emphasize the most captured emotional signals by the attention heads into one pooled vector suitable for classification.
Final Classification Layer:
This pooled output now acts as a compact and enriched representation of the sequence, which is fed into a fully connected classification layer. Comprising a linear transformation, this layer maps the pooled vector to the emotion labels; hence, the model emits a prediction based on the attended information. Using BCEWithLogitsLoss as the loss function ensures that the model works on multilabel classification, considering each label of emotion as a different binary classification.

With multiple attention heads, the model broadens its focus to capture cues for complex emotions and context that might be missed with a single attention pathway. This turns out particularly effective for multilabel emotion classification since it allows the model to approach identifying and separating the often overlapping emotional cues of short social media posts. Dropout is followed by pooling and classification, which forms the basis that allows the model to ensure translations of insights coming from the attention mechanism into the right predictions, hence enhancing its generalization capability across a wide range of social media environments.

2.4. Advanced Modification of the DES Framework

The Advanced DES framework dynamically chooses the best EmoBERTa-X classifier(s) depending on the nature of each input instance. This advanced modification ensures that the model will accommodate different types of social media text, accounting for variations in emotional complexity, length of the text, and ambiguity. By refining the internal structure of the DES framework, the proposed model improves performance on numerous dimensions of multilabel emotion classification and optimizes its response to diverse and informal data. In the proposed approach, four models were trained on different parts of the training set, introducing diversity into the ensemble. This diversity ensures that each model learns distinct patterns and generalizes differently, increasing the likelihood that at least one model in the ensemble will perform well for any given test instance.

2.4.1. Context-Sensitive Classifier Selection

The DES framework of EmoBERTa-X dynamically selects an optimal subset of classifiers from the ensemble at any given time, using context-sensitive evaluation metrics, i.e., it first performs an evaluation of the competence of each model on the present input instance. For that purpose, it relies on Hamming loss as its primary measure of classifier effectiveness, as shown in Figure 3.

For any classifier c and given input x, the competence score of this classifier is computed to be [39]:

Competence (c, x) = 1 - Hamming Loss (c, x)

(4)

Hamming Loss = \frac{1}{N \times L} \sum_{i = 1}^{N} \sum_{j = 1}^{L} 1 (Y_{i j} \neq {\hat{Y}}_{i j})

(5)

where:

N is the number of samples.
L is the number of labels.
$Y_{i j}$ and ${\hat{Y}}_{i j}$ are the true and predicted labels (1 or 0) for the j-th label of the i-th sample.
$1 (\cdot)$ is an indicator function that is 1 if the argument is true and 0 otherwise.

Once the competence scores have been computed, classifiers with scores above a pre-specified selection threshold

τ

are selected for the pool of the final ensemble. Such a threshold

τ

may guarantee that the selected classifiers will only present acceptable performances, thus decreasing the noise provided by poor performance models.

C^{*} = {c \in C ∣ Competence (c, x) \geq τ}

(6)

Then, dynamic weight values are computed for each selected classifier according to the estimated competence scores. This provides the ability of classifiers with higher scores to provide a greater influence in the final prediction than lower-scoring ones. Therefore, the following weighting scheme has been adopted:

Weight (c) = \frac{Competence (c, x)}{\sum_{k \in C^{*}} Competence (c_{k}, x)}

(7)

The weighted aggregation strategy produces the final prediction, in which the outputs of the selected classifiers are combined with their assigned weights. It can be a weighted majority vote or an averaging of probabilistic outputs. This dynamic adjustment in ensemble composition and weight allocation allows EmoBERTa-X to adapt better to different input instances, thereby improving the robustness and accuracy of classification.

This dynamic selection and weighting mechanism enables the framework to focus on the most relevant classifiers for any given input instance, optimizing the overall performance while reducing misclassification errors.

2.4.2. Multi-Contextual Selector Module

The selection mechanism is further extended by the introduction of the Multi-Contextual Selector Module (MCSM), which performs the selection based on a weighted combination of performance metrics, input characteristics, and contextual features of the data.

MCSM considers the following factors influencing the complexity of emotion detection:

Text Length: longer texts may provide more emotional context, while shorter texts require more focused attention.
Ambiguity: The use of informal language, slang, and abbreviations normally creates ambiguities in depicting emotions. The MCSM evaluates the degree of ambiguity using a custom Ambiguity Coefficient (AC) [40]:

$A C (x) = \frac{{Length}_{sentence}}{{Complexity}_{words} + ϵ}$

(8)

where:
-
${Length}_{sentence}$ represents the number of words in the input text x,
-
${Complexity}_{words}$ measures the average semantic complexity of the words based on their embeddings and contextual information,
-
$ϵ$ is a small constant to avoid division by zero.

This AC is used by MCSM to update classifier weights, giving a higher weight to classifiers that have performed well in the past on ambiguous or informal text. This refinement ensures that the DES framework dynamically adapts to changing conditions of the incoming feed to adapt better and improve its classification accuracy.

2.4.3. Weighted Balanced Sampling

An important addition that contributed to the training phase of DES was Weighted Balanced Sampling (WBS). This technique becomes highly relevant in cases of class imbalance problem issues; without this, biased predictions would be achieved towards more frequent classes.

Balanced Batch Construction: By construction, each training batch is formed by sampling examples in such a way that the distribution of all classes in the batch is much more balanced. Underrepresented classes weigh more highly, with a higher probability of being sampled, thus becoming much more frequent in training batches.
Model Diversity and Learning: Weighted balanced sampling across all constituent models of the ensemble implies that the DES framework capitalizes on models that have learned to refer to both common and rare classes correctly. Indeed, this introduces a positive combination effect in the polls.

2.4.4. Competence-Based Weighting and Aggregation

Once the competence scores are computed for each example, the best-performing models are selected. Through experiments, the best trade-off between performance and complexity relies on selecting only the top two best models, ensuring that final predictions are generated by models best suited to handle the particular case in question.

DES uses competence-based weighting to perform an aggregation of predictions from the selected models. During evaluation, its weighting will be based on the relative competence of each classifier for a particular test instance. It is computed as [41]:

\hat{Y} = \frac{\sum_{c \in C} w_{c} \cdot {\hat{Y}}_{c}}{\sum_{c \in C} w_{c}}

(9)

where:

${\hat{Y}}_{c}$ represents the prediction of classifier c,
$w_{c}$ is the competence-based weight assigned to classifier c,
$\hat{Y}$ is the final prediction obtained by averaging the weighted outputs of the classifiers.

This makes for a competence-based aggregation wherein the models with higher scores of competence have a greater influence on the final prediction, therefore contributing to an accurate and contextually aware classification of the emotions.

2.5. Model Evaluation

The performance had been finally evaluated on different multilabel classification-based metrics such as accuracy, Hamming loss (previously presented in Equation (5)), and F1-scores (skewed micro, macro, and weighted) [39] all provide a full-scale review of the model performance for the right classification of multiple emotions associated with each instance and robustness across the classes and labels.

In multilabel classification, accuracy is averaged over the proportion of correctly predicted labels for each sample. For any instance i, this metric measures the intersection between the true labels

Y_{i}

and the predicted labels

{\hat{Y}}_{i}

, divided by the union of these labels, as calculated in Equation (10). This formulation ensures that the accuracy allows for partial matches where multiple labels can be assigned to one sample, which is often allowed in emotion classification. Here, N represents the number of samples in the dataset. This metric will provide insight into overall performance, and partially correct predictions will also be considered.

Accuracy = \frac{1}{N} \sum_{i = 1}^{N} \frac{| Y_{i} \cap {\hat{Y}}_{i} |}{| Y_{i} \cup {\hat{Y}}_{i} |}

(10)

Hamming loss (Equation (5)) quantifies how often incorrect labels are assigned relative to the total number of labels across all samples. For a given sample i and label j, this metric checks whether the predicted label

{\hat{Y}}_{i j}

matches the true label.

Y_{i j}

. The indicator function

1 (\cdot)

is 1 in the case of a mismatch and 0 in the case of a match. Such a metric of error rate would carry much information in multilabel classification, where one sample can have more than one correct label. A lower Hamming loss would thus indicate fewer misclassifications by the model over the labels, something quite vital for emotion detection.

The F1-score in multilabel emotion classification can be calculated using various methods, depending on exactly what aspects of model performance are to be captured. Micro F1-score sums the true positives, false positives, and false negatives for all labels and then calculates a single F1-score from these. Since, in this method, every instance of a label is taken into consideration to be equal, it turns out to be particularly useful in handling imbalanced datasets where certain emotions come up more than others. While the macro F1-score treats each label independently, computing its F1-score and averaging, he adopts an approach of treating each label equally. This may serve well in assessing the performance of the model across both common and rare emotions to single out weaknesses in detecting infrequent labels. Finally, the weighted F1-score is that metric that merges both elements in that, for every label, it calculates the F1-score and weighs it by the frequency of true instances for each label. In that way, it will be a balanced metric because it reflects the general performance of the model concerning class imbalance. The frequent emotions will proportionally have their weight in the final score, shown in Equations (11), (12) and (13), respectively.

Micro F 1 = \frac{2 \cdot \sum_{j = 1}^{L} {TP}_{j}}{2 \cdot \sum_{j = 1}^{L} {TP}_{j} + \sum_{j = 1}^{L} {FP}_{j} + \sum_{j = 1}^{L} {FN}_{j}}

(11)

Macro F 1 = \frac{1}{L} \sum_{j = 1}^{L} \frac{2 \cdot {TP}_{j}}{2 \cdot {TP}_{j} + {FP}_{j} + {FN}_{j}}

(12)

Weighted F 1 = \frac{1}{\sum_{j = 1}^{L} | Y_{j} |} \sum_{j = 1}^{L} | Y_{j} | \cdot \frac{2 \cdot {TP}_{j}}{2 \cdot {TP}_{j} + {FP}_{j} + {FN}_{j}}

(13)

where:

${TP}_{j}$ , ${FP}_{j}$ , and ${FN}_{j}$ are the true positives, false positives, and false negatives for label j, respectively,
$| Y_{j} |$ is the number of true instances for label j.

3. Experimentation and Results

3.1. GoEmotions Dataset

The GoEmotions dataset is a large and well-processed dataset designed by Google for emotion recognition research. GoEmotions, containing more than 58,000 sentences in English curated from Reddit, is one of the largest available datasets for emotion-classification tasks. The coverage includes 27 distinct emotion categories plus one neutral label, which makes this dataset uniquely positioned to explore the complexity and diversity of human emotional expressions. GoEmotions contains everything from short phrases to full-blown sentences, reflecting the often informal and varied language of online platforms.

The GoEmotions dataset was chosen for its diversity, multilabel format, and wide emotional range, which together help models train on varied, realistic examples. Covering 27 emotion categories, GoEmotions spans a broader spectrum than most datasets, allowing for insightful mapping of Ekman’s six foundational emotions. Sourced from Reddit, this dataset captures contemporary, informal language, making it particularly useful for real-world applications like social media analysis, unlike more formal datasets, which often lack broad applicability. While ISEAR and EmoReact, for example, have fewer emotions or more controlled experimental data that somehow represses the complexity found in natural speech, the size and diversity of the GoEmotions dataset best fit its use for training and testing more advanced models such as EmoBERTa-X. Emotion datasets are, on the other hand, small and structured, reducing their applicability to informal real-world texts. This research ensured that GoEmotions was used in the testing of EmoBERTa-X so that it maximized its potential for more accurate multilabel emotion classification and real-world relevance.

3.1.1. Label Mapping

Original 27 emotion labels were aligned to map into six basic emotions of Ekman: happiness, sadness, anger, fear, surprise, and disgust, adding a neutral category [42]. This decision was made given that Ekman’s model is highly applied in emotion-classification research and offers a robust, comparable basis with other studies. Each of the 27 original labels was first subjected to semantic analysis for similarities and then subsequently grouped into categories. Where any of the labels spanned more than one basic emotion, careful judgments were made to categorize such into the most relevant group. The mapping retained richness while aligning the dataset with a simpler, well-recognized structure serving the purposes of this study.

Labels such as anger and annoyance were grouped under the Anger category.
Sadness and related emotions like disappointment were mapped to the Sadness category.
Positive emotions like happiness, amusement, and excitement were categorized under Joy.

Labels that corresponded directly, such as fear and surprise, were kept intact. When the labels did not squarely apply to any of Ekman’s basic categories, such as gratitude or caring, a contextual assessment was made. If these labels were positive and uplifting, they were classified under Joy; otherwise, they fell into the neutral category, which represented a lack of strong or distinct emotion.

The creation of the mapping table, as shown in Table 1, consistently mapped the 27 labels into Ekman’s six categories plus neutral. The table provides guidelines on how to transform a dataset, helping maintain consistency throughout a dataset transformation. Preceding that, the following rules were followed:

After the mapping stage, each instance in GoEmotions, previously annotated with one or more of the 27 labels, had to be reclassified under Ekman’s categories, as shown in Table 2. Where appropriate, multilabel classification methods were retained so as not to lose the richness of emotions expressed in the text. Using the same example of instances labeled with both joy and surprise, the mapping would be done such that it reflects both categories simultaneously. Figure 4 highlights the varying frequencies of each emotion, with ‘Joy’ being the most represented and ‘Disgust’ being the least, illustrating the class imbalance within the dataset.

3.1.2. GoEmotions Preprocessing

Preprocessing techniques were performed to prepare GoEmotions for the training of EmoBERTa-X. Expanding the abbreviation was necessary as the dataset contains informal speech and shorthand expressions common on Reddit. This requires a hand-built dictionary that expands abbreviations like “IDK” to “I don’t know” and “LOL” to “laughing out loud” to ensure that the model receives standardized input as shown in Table 3. Tokenization was performed with RoBERTa’s tokenizer to make the data coherent with the model’s input requirements. These steps are important in making the data adequate for the model, EmoBERTa-X, designed with complex tasks in natural language processing.

3.2. Experimental Work

A series of experiments have been done to evaluate the performance of the EmoBERTa-X model by progressively refining and fine-tuning the model architecture and the preprocessing steps.

3.2.1. Handling Overlapping Emotions

Most emotion recognition multilabel classification suffers from label overlap, which is very normal in texts for having several close-sounding emotions of joy and pride or fear and sadness occurring within one text. This makes it quite tough to identify and distinguish subtle differences with accuracy. EmoBERTa-X overcomes this through the combination of a multi-head attention mechanism that boosts the model’s attention to multiple emotional cues lying within the same text besides DES, which selects the most competent classifiers for resolving overlapping emotional contexts.

To analyze how the model handles overlapping emotions, a specific example from the GoEmotions dataset was evaluated:

“I can’t believe I finally did it! I’m so proud and happy”
-
Ground Truth: Joy, Pride
-
Model Prediction (without DES): Joy
-
Model Prediction (with DES): Joy, Pride

Accordingly, the findings are: Without the DES framework, the model tends to default to mainstream emotions—for example, favoring joy over pride—and struggles to capture small differences in co-existing emotions. With the DES framework, the model adapts more effectively to overlapping contexts by employing competence-based weighting, which enhances its performance.

3.2.2. Ablation Study

The experiments for EmoBERTa-X were carried out in a systematic series, each adding a new component or adjustment to evaluate their contributions. All experiments are uniform concerning their training parameters; the number of epochs applies the same early stopping strategy, the optimizer, as well as the 1 × 10⁻⁵ learning rate for the AdamW optimizer and the loss function. The experiments progressively built upon each other to refine the model’s architecture and training approach.

Experiment 1: Baseline RoBERTa
The goal of this first experiment was to establish the baseline performance by fine-tuning RoBERTa with no change in the layers of the model and the dataset. This provided a starting point in understanding the model’s capability to handle multilabel emotion classification.
The baseline model gave an accuracy of 65.94%, the micro F1-score was 67.07%, the macro F1-score was 60.55%, and Hamming loss at 0.0935. These metrics were able to provide the initial view of model performance about both general accuracy and the balance between precision and recall.
This was a reasonable baseline performance, but it hinted at some limitations with respect to properly modeling multilabel classification tasks, particularly when it comes to aiming for high recall on the less frequent emotion labels. This setup pointed to the need for further modifications to enhance the model’s handling of complicated and multilabel data and increase generalization across diverse emotional expressions.
Experiment 2: RoBERTa plus Attention
In this experiment, multi-head attention would be added to RoBERTa with 8 attention heads, such that it could pay more attention to the most relevant parts of the input, which accordingly helps the model grasp and elevate salient features of the text, thus maximizing its performance in multilabel classification.
Apart from that, this modification yielded an accuracy improvement to reach 66.14%, 67.23% Micro F1-score, 60.37% Macro F1-score, and Hamming loss value 0.0929.
These results showed that attention introduced an improvement, demonstrated by the increase in the model’s attention to key textual features.
Experiment 3: DES using RoBERTa plus Attention
The experiment was designed to test the integration of DES in EmoBERTA-X to choose the ’best’ subset of models dynamically during an inference. The objective was to control DES in a way that would use its better adaptability and make more intelligent decisions based on input variability.
After training four instances of the model, it is reflected in this experiment that the performance accuracy was enhanced significantly by 73.79%, the micro F1-score was 75.05%, the macro F1-score was 69.08%, and the Hamming loss was 0.0703.
This significantly enhances the reliability of the model, as it dynamically allows DES to select which models are most relevant for any given input. The adaptiveness results in overall boosting performance, especially in complex multilabel tasks that underline the power of ensemble strategies in boosting performance.
Experiment 4: DES using RoBERTa plus Attention applying ASEM
The experiment is to embed abbreviation expansion into the preprocessing step to normalize informal language and make it somewhat more interpretable for the model. This will make the model understand colloquial expressions that might be present in text data.
With abbreviation expansion added to the model, the results slightly improved as well: accuracy at 73.85%, a micro F1-score of 75.04%, while maintaining Hamming loss at 0.0704.
This suggested that an improved handling of abbreviations helped the model further tune its knowledge of informal language, resulting in improved classification outcomes, particularly when abbreviations would have otherwise made those instances harder to classify.
Experiment 5: DES using RoBERTa plus Attention applying ASEM and Emoji Conversion
This experiment introduced the conversion of emojis into the preprocessing pipeline to represent the emotions carried by non-verbal symbols. The conversion was expected to introduce depth in context to the texts, allowing the model to pick up those subtler emotions that were missed earlier. Converting emojis also dropped performance a bit. Now, accuracy is 72.80%, micro F1-score is 74.25%, and Hamming loss is 0.0725.
However, the results showed that, in this setup, emojis can add context only to a limited degree, thus providing minor improvements for the model. This could mean that poorly represented emoji data may have little impact on classification accuracy.
Experiment 6: DES using RoBERTa plus Attention applying ASEM and TVNR
The goal of this experiment was to further refine preprocessing by adding contraction replacements such as “can’t” to “cannot”. This is expected to serve the dual purpose of enhancing the model’s parsing of text with complex structures and improving its interpretation of sentence forms in various ways.
It resulted in the following: an accuracy of 73.73%, a micro F1-score of 75.01%, and a Hamming loss of 0.0704. Although the addition of contraction replacement did not introduce a significant jump in performance, the experiment helped maintain the model’s effectiveness by increasing its understanding of text where contractions were present. This showed that fine-tuning preprocessing can make it possible for the model to deal with real-world text variation.
Experiment 7: Addition of WBS during Training
The last experiment tackles the class imbalance problem at training time with weighted balanced sampling. In such a way, it was ensured that the model received more balanced examples of all classes (special attention was given to the weaker classes) so that it could generalize better.
Among all the previous experiments, the best performance was 75.52%, 76.10% for the micro F1-score, 70.13% for the macro F1-score, and the lowest Hamming loss is 0.0679. Weighted balanced sampling significantly improved the model’s recall towards the classes with low frequency, and its generalization was superior while performing on par for all classes. This result has clearly established class balancing as an integral part of top-performing multilabel classification.

This series of experiments with the EmoBERTa-X model also serves as an ablation study by evaluating the effect of each model component and each preprocessing step. Refining the model in steps and modifying it isolates the contribution of each element to the overall performance, as shown in Table 4.

Furthermore, Figure 5 shows the improvement in F1-scores across the expirements. The results show that the largest improvement comes from abbreviation expansion (ASEM), which normalizes informal text and provides clearer semantic meaning. Token variation reduction (TVNR) further stabilizes the model by reducing noise from contractions. The inclusion of emoji conversion showed a relatively smaller but positive effect, reflecting the importance of handling non-verbal cues in social media text. Accordingly, the preprocessing pipeline, when combined, yielded a much higher accuracy and F1-score, emphasizing the importance of handling informal and ambiguous text before model training.

The challenges in the emotion classification with this dataset include but are not limited to, the informal nature of languages, the imbalance in the class distribution, the complexity introduced by multilabel classification, and ambiguity in the short instances of text. EmoBERTa-X overcomes all these issues through a combination of advanced techniques described in Section 2.

Class Imbalance: As deduced from Figure 4, this may result in a model biased toward the more frequent labels performing poorly when generalizing to the less frequent emotions. However, WBS was used to sample the underrepresented emotions more during training. This ensures a better generalization of the model for both frequent and rare emotion labels as shown in the results achieved in Table 5. This table compares recall and precision scores for frequent and rare emotion classes with and without applying WBS. The results show that WBS significantly improves recall for rare emotions, such as disgust (+5.8) and fear (+2.1), while also enhancing precision. For frequent emotions, such as joy and neutrality, performance is maintained or slightly improved, with notable gains in precision (+1.5 for joy and +6.6 for neutrality). This highlights WBS’s effective applicability to multilabel emotion-classification tasks, especially in addressing severe class imbalance.
Short Texts and Ambiguity: Many posts in GoEmotions are of limited length, and many times ambiguous, hence difficult to capture the intended emotional content. However, by taking advantage of the features of EmoBERTa-X with Multi-head Attention Mechanisms, it can better disambiguate the meaning of short texts and extract their emotional context more precisely.
Multilabel Classification Complexity: An important challenge of the GoEmotions dataset is that emotion classification is multilabeled, which means one instance can have more than one emotion; this was successfully handled by presenting DES.
Handling Informal Language: The social media dataset contains a lot of informal language, slang, abbreviations, and nonstandard grammar. Abbreviation expansion allows better handling of informal expressions. This preprocessing, along with the expansion in EmoBERTa-X, allows the model to better interpret slang and informal languages in this dataset.

To further verify the performance gains of the proposed EmoBERTa-X over the baseline model, a paired t-test was conducted to verify whether the proposed model outperforms the baseline significantly. The t-test yielded a p-value of 0.0312, below the generally used significance level of 0.05. Hence, improvements were statistically significant and unlikely to occur under random chance.

3.3. Comparison to the State-of-the-Art Models

Figure 6 provides a visual performance comparison between EmoBERTa-X and the state-of-the-art models based on accuracy, micro F1-score, and macro F1-score. Table 6 presents a performance comparison between the results obtained using EmoBERTa-X and those of state-of-the-art models from earlier works. The proposed multi-head attention mechanism and DES framework demonstrate significant improvements in accuracy, micro F1-score, and macro F1-score compared to competing approaches.

Specifically, EmoBERTa-X obtains an accuracy of 75.5%, a micro F1-score of 76.1%, and a macro F1-score of 70.1%, which outperforms the closest competitor UCCA-GAT [43], which achieved 71.2% in terms of accuracy and 75.4% micro F1-score. Notably, the macro F1-score of UCCA-GAT is significantly lower at 63.9%, indicating that EmoBERTa-X has a better capability to achieve higher accuracy for both frequent and rare emotion labels.

The improvement from transformer-based models such as RoBERTa and Dim-RoBERTa [44] achieving 65.9% and 65.7%, respectively, makes EmoBERTa-X’s accuracy enhancement approximately 9.6%. This highlights the impact of integrating DES with multi-head attention, allowing the model to recognize the overlapping emotional signals from short and informal text more precisely.

Finally, the competence-based classifier selection further enhances EmoBERTa-X’s ability to manage imbalanced classes, achieving a lower Hamming loss compared to baseline transformer models. This is particularly significant for handling the highly skewed distribution of emotions often found in social media texts.

EmoBERTa-X’s performance improvements demonstrate the effectiveness of integrating DES with transformer models, particularly in multilabel emotion classification. The combination of DES and the multi-head attention mechanism enables the model to handle overlapping emotional contexts effectively, as illustrated in Section 3.2.1. The results validate that the proposed approach provides significant advantages over existing methods, particularly in handling informal social media text.

However, challenges remain in scenarios where emotional expressions are ambiguous or subtle, such as distinguishing between closely related emotions like joy and contentment or identifying implied emotions. While the model shows improvements in overlapping contexts (e.g., capturing both joy and pride in the example in Section 3.2.1, it can struggle when contextual cues are weak, or emotions are not explicitly stated.

Moreover, the computational complexity introduced by the DES process makes scalable real-time applications a very challenging task because classifier selection needs more processing time. Future work can focus on the optimization of selection strategies and the use of any additional information in the form of external knowledge sources, such as sentiment lexicons or common-sense knowledge bases, to make the model more robust for highly ambiguous emotional contexts.

Table 6. Performance comparison of EmoBERTa-X with state-of-the-art models on the GoEmotions dataset: Table below provides Accuracy, Micro F1-score, and Macro F1-score of different models, which compare EmoBERTa-X against Graph-based, Transformer-based, and RoBERTa-based models.

Model	Accuracy (%)	Micro F1 (%)	Macro F1 (%)	Key Observation
UCCA-GAT [43]	71.2	75.4	63.9	Lower macro F1 highlights poor handling of rare emotions.
Dep-GAT [43]	68.7	74.7	61.1	Lacks robust contextual adaptation.
BERT [45]	-	-	64.0	Basic transformer model without optimizations.
RoBERTa [44]	65.9	69.1	61.8	Improved over BERT but struggles with multilabel tasks.
Dim-RoBERTa [44]	65.7	68.6	61.0	Employs dimensionality reduction, enhancing efficiency but still struggles with rare emotional categories and overlapping emotions.
Proposed EmoBERTa-X	75.5	76.1	70.1	Excels in handling informal text, balances performance across all emotion classes with dynamic ensemble selection.

Note: The bold value represents the result for EmoBERTa-X, highlighting its best performance in the comparison.

4. Conclusions and Future Work

In this work, we proposed EmoBERTa-X, a new model that significantly improves multilabel emotion classification by effectively combining the strong, deep contextualized understanding of RoBERTa with the flexibility and directness of dynamic ensemble selection. EmoBERTa-X, while being applied to the challenging GoEmotions dataset, yielded the best results in terms of accuracy (75.52%) micro F1-score (76.10%), and macro F1-score (70.13%). It outperforms existing models and tries to tackle the most relevant weaknesses regarding the handling of rare emotions from prior approaches. A paired t-test showed that the improvements observed were statistically significant because of a p-value of 0.0312, hence strengthening the evidence that the presented model enhancements will not be due to random chance.

Future work could focus on expanding EmoBERTa-X to other languages and domains by leveraging multilingual datasets, cross-domain corpora, and techniques like cross-lingual transfer learning. Optimizing computational efficiency through model compression, quantization, and pruning could enable real-time applications on mobile devices. Additionally, domain-specific fine-tuning and semi-supervised learning methods could enhance robustness and adaptability, ensuring broader adoption across diverse contexts.

Author Contributions

Conceptualization, F.H.L., M.E. and S.N.S.; methodology, F.H.L.; software, F.H.L.; validation, F.H.L., M.E. and S.N.S.; formal analysis, F.H.L.; investigation, F.H.L.; resources, F.H.L.; data curation, F.H.L.; writing—original draft preparation, F.H.L.; writing—review and editing, F.H.L., M.E. and S.N.S.; visualization, F.H.L.; supervision, M.E. and S.N.S.; project administration, M.E. and S.N.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data available upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

García-Hernández, R.A.; Luna-García, H.; Celaya-Padilla, J.M.; García-Hernández, A.; Reveles-Gómez, L.C.; Flores-Chaires, L.A.; Delgado-Contreras, J.R.; Rondon, D.; Villalba-Condori, K.O. A Systematic Literature Review of Modalities, Trends, and Limitations in Emotion Recognition, Affective Computing, and Sentiment Analysis. Appl. Sci. 2024, 14, 7165. [Google Scholar] [CrossRef]
Hanna, R.; Rohm, A.; Crittenden, V.L. We’re all connected: The power of the social media ecosystem. Bus. Horizons 2011, 54, 265–273. [Google Scholar] [CrossRef]
Tawfik, A.; Elkhodary, H.O.; Saleh, S.N. A Deep Learning-based Emotion Recognition System for Interactive E-Learning. In Proceedings of the 2022 32nd International Conference on Computer Theory and Applications (ICCTA), Alexandria, Egypt, 17–19 December 2022; pp. 38–43. [Google Scholar] [CrossRef]
Brynielsson, J.; Johansson, F.; Jonsson, C.; Westling, A. Emotion classification of social media posts for estimating people’s reactions to communicated alert messages during crises. Secur. Inform. 2014, 3, 7. [Google Scholar] [CrossRef]
Sharma, A.; Sharma, K.; Kumar, A. Real-time emotional health detection using fine-tuned transfer networks with multimodal fusion. Neural Comput. Appl. 2023, 35, 22935–22948. [Google Scholar] [CrossRef]
Kusal, S.; Patil, S.; Kotecha, K.; Aluvalu, R.; Varadarajan, V. AI Based Emotion Detection for Textual Big Data: Techniques and Contribution. Big Data Cogn. Comput. 2021, 5, 43. [Google Scholar] [CrossRef]
Mansoor, M.A.; Ansari, K.H. Early Detection of Mental Health Crises through Artifical-Intelligence-Powered Social Media Analysis: A Prospective Observational Study. J. Pers. Med. 2024, 14, 958. [Google Scholar] [CrossRef]
Asghar, M.Z.; Khan, A.; Bibi, A.; Kundi, F.M.; Ahmad, H. Sentence-level emotion detection framework using rule-based classification. Cogn. Comput. 2017, 9, 868–894. [Google Scholar] [CrossRef]
Berka, P. Sentiment analysis using rule-based and case-based reasoning. J. Intell. Inf. Syst. 2020, 55, 51–66. [Google Scholar] [CrossRef]
Wang, L.; Isomura, S.; Ptaszynski, M.; Dybala, P.; Urabe, Y.; Rzepka, R.; Masui, F. The Limits of Words: Expanding a Word-Based Emotion Analysis System with Multiple Emotion Dictionaries and the Automatic Extraction of Emotive Expressions. Appl. Sci. 2024, 14, 4439. [Google Scholar] [CrossRef]
Öhman, E. The validity of lexicon-based sentiment analysis in interdisciplinary research. In Proceedings of the Workshop on Natural Language Processing for Digital Humanities, Silchar, India, 16–19 December 2021; pp. 7–12. [Google Scholar]
Nandwani, P.; Verma, R. A review on sentiment analysis and emotion detection from text. Soc. Netw. Anal. Min. 2021, 11, 81. [Google Scholar] [CrossRef] [PubMed]
Sujanaa, J.; Palanivel, S.; Balasubramanian, M. Emotion recognition using support vector machine and one-dimensional convolutional neural network. Multimed. Tools Appl. 2021, 80, 27171–27185. [Google Scholar] [CrossRef]
Semary, N.A.; Ahmed, W.; Amin, K.; Pławiak, P.; Hammad, M. Enhancing machine learning-based sentiment analysis through feature extraction techniques. PLoS ONE 2024, 19, e0294968. [Google Scholar] [CrossRef] [PubMed]
Sarsam, S.M.; Al-Samarraie, H.; Alzahrani, A.I.; Wright, B. Sarcasm detection using machine learning algorithms in Twitter: A systematic review. Int. J. Mark. Res. 2020, 62, 578–598. [Google Scholar] [CrossRef]
Bouazizi, M.; Ohtsuki, T.O. A pattern-based approach for sarcasm detection on twitter. IEEE Access 2016, 4, 5477–5488. [Google Scholar] [CrossRef]
Shiri, F.M.; Perumal, T.; Mustapha, N.; Mohamed, R. A comprehensive overview and comparative analysis on deep learning models: CNN, RNN, LSTM, GRU. arXiv 2023, arXiv:2305.17473. [Google Scholar]
Iyer, A.; Das, S.S.; Teotia, R.; Maheshwari, S.; Sharma, R.R. CNN and LSTM based ensemble learning for human emotion recognition using EEG recordings. Multimed. Tools Appl. 2023, 82, 4883–4896. [Google Scholar] [CrossRef]
Chen, M. Emotion analysis based on deep learning with application to research on development of Western culture. Front. Psychol. 2022, 13, 911686. [Google Scholar] [CrossRef]
Bodapati, S.; Bandarupally, H.; Shaw, R.N.; Ghosh, A. Comparison and analysis of RNN-LSTMs and CNNs for social reviews classification. In Advances in Applications of Data-Driven Computing; Springer: Singapore, 2021; pp. 49–59. [Google Scholar]
Liu, N.; Ren, F. Emotion classification using a CNN_LSTM-based model for smooth emotional synchronization of the humanoid robot REN-XIN. PLoS ONE 2019, 14, e0215216. [Google Scholar] [CrossRef]
Acheampong, F.A.; Nunoo-Mensah, H.; Chen, W. Transformer models for text-based emotion detection: A review of BERT-based approaches. Artif. Intell. Rev. 2021, 54, 5789–5829. [Google Scholar] [CrossRef]
Rezapour, M. Emotion Detection with Transformers: A Comparative Study. arXiv 2024, arXiv:2403.15454. [Google Scholar]
Ganie, A.G. Presence of informal language, such as emoticons, hashtags, and slang, impact the performance of sentiment analysis models on social media text? arXiv 2023, arXiv:2301.12303. [Google Scholar]
Aliyu, Y.; Sarlan, A.; Danyaro, K.U.; Rahman, A.S. Comparative Analysis of Transformer Models for Sentiment Analysis in Low-Resource Languages. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 353. [Google Scholar] [CrossRef]
Ramaswamy, S.L.; Chinnappan, J. RecogNet-LSTM+ CNN: A hybrid network with attention mechanism for aspect categorization and sentiment classification. J. Intell. Inf. Syst. 2022, 58, 379–404. [Google Scholar] [CrossRef]
Ramirez-Alcocer, U.M.; Tello-Leal, E.; Hernandez-Resendiz, J.D.; Romero, G. A Hybrid CNN-LSTM Approach for Sentiment Analysis. In Proceedings of the Congress on Intelligent Systems, Bengaluru, India, 4–5 September 2023; Springer: Singapore, 2023; pp. 425–437. [Google Scholar]
Saleh, S.N. Enhancing multilabel classification for unbalanced COVID-19 vaccination hesitancy tweets using ensemble learning. Comput. Biol. Med. 2025, 184, 109437. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Wang, G.; Kong, H. Emotion Recognition Based on Dynamic Ensemble Feature Selection. In Man-Machine Interactions; Springer: Berlin/Heidelberg, Germany, 2009; pp. 217–225. [Google Scholar]
Costa, J.; Silva, C.; Antunes, M.; Ribeiro, B. Boosting dynamic ensemble’s performance in twitter. Neural Comput. Appl. 2020, 32, 10655–10667. [Google Scholar] [CrossRef]
Pan, B.; Hirota, K.; Jia, Z.; Zhao, L.; Jin, X.; Dai, Y. Multimodal emotion recognition based on feature selection and extreme learning machine in video clips. J. Ambient. Intell. Humaniz. Comput. 2023, 14, 1903–1917. [Google Scholar] [CrossRef]
Patwardhan, N.; Marrone, S.; Sansone, C. Transformers in the real world: A survey on nlp applications. Information 2023, 14, 242. [Google Scholar] [CrossRef]
Chen, X.; Yin, Y.; Feng, T. Multi-Label Text Classification Based on BERT and Label Attention Mechanism. In Proceedings of the 2023 Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC), Dalian, China, 14–16 April 2023; pp. 386–390. [Google Scholar]
Yuan, L.; Xu, X.; Sun, P.; Yu, H.P.; Wei, Y.Z.; Zhou, J.J. Research of multi-label text classification based on label attention and correlation networks. PLoS ONE 2024, 19, e0311305. [Google Scholar] [CrossRef]
Zhou, Z.H. Ensemble Methods: Foundations and Algorithms; CRC Press: Boca Raton, FL, USA, 2012. [Google Scholar]
Rokach, L. Ensemble-based classifiers. Artif. Intell. Rev. 2010, 33, 1–39. [Google Scholar] [CrossRef]
Lowe, S. RoBERTa-Base Model on GoEmotions. 2022. Available online: https://huggingface.co/SamLowe/roberta-base-go_emotions (accessed on 4 November 2024).
Vaswani, A. Attention is all you need. In Advances in Neural Information Processing Systems; NeurIPS Foundation; Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
McNamara, D.S.; Graesser, A.C.; McCarthy, P.M.; Cai, Z. Automated Evaluation of Text and Discourse with Coh-Metrix; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
Kuncheva, L.I. Combining Pattern Classifiers: Methods and Algorithms; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
Ekman, P. Are there basic emotions? Psychol. Rev. 1992, 99, 550–553. [Google Scholar] [CrossRef] [PubMed]
Ameer, I.; Bölücü, N.; Sidorov, G.; Can, B. Emotion classification in texts over graph neural networks: Semantic representation is better than syntactic. IEEE Access 2023, 11, 56921–56934. [Google Scholar] [CrossRef]
Ameer, I.; Bölücü, N.; Siddiqui, M.H.F.; Can, B.; Sidorov, G.; Gelbukh, A. Multi-label emotion classification in texts using transfer learning. Expert Syst. Appl. 2023, 213, 118534. [Google Scholar] [CrossRef]
Demszky, D.; Movshovitz-Attias, D.; Ko, J.; Cowen, A.; Nemade, G.; Ravi, S. GoEmotions: A dataset of fine-grained emotions. arXiv 2020, arXiv:2005.00547. [Google Scholar]

Figure 1. EmoBERTa-X model architecture: This diagram illustrates the sequential workflow of the EmoBERTa-X model, beginning with data loading and preprocessing, followed by model training, dynamic ensemble selection, and concluding with model evaluation.

Figure 2. EmoBERTa-X model with integrated multi-head attention mechanism: The general model structure is constituted of sequential layers, where the model starts with embeddings and an encoder, followed by the multi-head attention module. This will involve attention output average pooling, a dense layer processed by dropout, and final classification layers that lead to the output layer for multilabel emotion classification. SDP is the Scale Dot-Product.

Figure 3. EmoBERTa-X training and dynamic ensemble selection process: The training of several instances of EmoBERTa-X, each computing a competence score; the DES framework selects the top-performing EmoBERTa-X based on the competence scores, pools its predictions, and then moves on to model evaluation.

Figure 4. Distribution of emotions to be classified by EmoBERTa-X across different categories.

Figure 5. Trend of micro and macro F1-scores across experiments: This line chart shows the progress of the micro and macro F1-scores of the EmoBERTa-X model across different sets of experiments.

Figure 6. Performance comparison of EmoBERTa-X with the state-of-the-art models: The following figure illustrates the accuracy, micro F1-score, and macro F1-score of EmoBERTa-X compared to the existing graph-based, transformer-based, and hybrid approaches.

Table 1. Ekman mapping table. The table below depicts how the 27 original categories of emotion in the GoEmotions dataset were mapped into Ekman’s six basic emotions.

Ekman Mapping	Emotion(s)
Anger	Anger, Annoyance, Disapproval
Disgust	Disgust
Fear	Fear, Nervousness
Joy	Joy, Amusement, Approval, Excitement, Gratitude, Love, Optimism, Relief, Pride, Admiration, Desire, Caring
Sadness	Sadness, Disappointment, Embarrassment, Grief, Remorse
Surprise	Surprise, Realization, Confusion, Curiosity

Table 2. Examples of multilabel vlassification from GoEmotions Ekman version: Here are some examples from GoEmotions to illustrate how sentences are mapped even further to the emotion categories in Ekman’s six basic emotions. Each row represents a sample text with its corresponding multilabel emotion classification and the mapped Ekman emotion categories.

Text	Emotion	Ekman Emotion
This is so bad that I immediately retold it to everyone I know.	Disappointment, Embarrassment	Sadness
I didn’t read that but so what?	Annoyance, Curiosity, Disapproval	Anger, Surprise
Happy to be able to help.	Joy	Joy

Table 3. Data before and after preprocessing: This table presents examples of sentences from the GoEmotions dataset before and after applying ASEM. The changes in bold represent the expansion forms of abbreviations and slang expressions such as “btw” to “by the way”, “lol” to “laugh out loud”, and “omg” to “oh my god”.

Before Preprocessing	After Preprocessing
Nice job building yourself btw	Nice job building yourself by the way
Lol it’s a bit of both I think.	laugh out loud it’s a bit of both I think.
omg, poor little bean	oh my god, poor little bean

Table 4. Performance comparison of the full model and its variations without specific components. This table highlights the effect of removing each component on different evaluation metrics.

Experiment ID	Experiment Description	Accuracy (%)	Micro F1 (%)	Macro F1 (%)	Hamming Loss
1	Baseline Model, without preprocessing	65.94	67.07	60.55	0.0935
2	Adding Multi-head Attention	66.14	67.23	60.37	0.0929
3	Applying DES	73.79	75.05	69.08	0.0703
4	Addition of ASEM	73.85	75.04	68.82	0.0704
5	With Emoji Conversion	72.80	74.25	68.39	0.0725
6	Adding TVNR	73.73	75.01	68.96	0.0704
7	Proposed Model (EmoBERTa-X)	75.52	76.10	70.13	0.0679

Note: Bold values indicate the best-performing results in each metric.

Table 5. Effect of WBS on the performance of emotion classification.

Emotion Category	Emotion	Without WBS	With WBS	Improvement
Emotion Category	Emotion	P/R	P/R	P/R
Frequent	Joy	88.2/82.4	89.7/83.9	+1.5/+1.5
	Neutral	57.5/76.8	64.1/79.6	+6.6/+2.8
	Sadness	68.3/70.0	71.8/71.1	+3.5/+1.1
Rare	Anger	61.5/64.8	66.1/68.9	+4.6/+4.1
	Disgust	51.3/56.7	58.5/62.5	+7.2/+5.8
	Fear	75.9/63.2	81.0/65.3	+5.1/+2.1
	Surprise	72.3/57.0	77.0/59.3	+4.7/+2.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Labib, F.H.; Elagamy, M.; Saleh, S.N. EmoBERTa-X: Advanced Emotion Classifier with Multi-Head Attention and DES for Multilabel Emotion Classification. Big Data Cogn. Comput. 2025, 9, 48. https://doi.org/10.3390/bdcc9020048

AMA Style

Labib FH, Elagamy M, Saleh SN. EmoBERTa-X: Advanced Emotion Classifier with Multi-Head Attention and DES for Multilabel Emotion Classification. Big Data and Cognitive Computing. 2025; 9(2):48. https://doi.org/10.3390/bdcc9020048

Chicago/Turabian Style

Labib, Farah Hassan, Mazen Elagamy, and Sherine Nagy Saleh. 2025. "EmoBERTa-X: Advanced Emotion Classifier with Multi-Head Attention and DES for Multilabel Emotion Classification" Big Data and Cognitive Computing 9, no. 2: 48. https://doi.org/10.3390/bdcc9020048

APA Style

Labib, F. H., Elagamy, M., & Saleh, S. N. (2025). EmoBERTa-X: Advanced Emotion Classifier with Multi-Head Attention and DES for Multilabel Emotion Classification. Big Data and Cognitive Computing, 9(2), 48. https://doi.org/10.3390/bdcc9020048

Article Menu

EmoBERTa-X: Advanced Emotion Classifier with Multi-Head Attention and DES for Multilabel Emotion Classification

Abstract

1. Introduction

2. Applied Methodology

2.1. Advanced Preprocessing Techniques

2.1.1. Abbreviation and Slang Expansion Module

2.1.2. Context-Sensitive Embedding Refinement

2.1.3. Token Variation and Noise Reduction

2.1.4. Handling Text Length and Padding (HTLP)

2.2. Comprehensive Model Framework

2.3. Integration of Multi-Head Attention with EmoBERTa-X

2.4. Advanced Modification of the DES Framework

2.4.1. Context-Sensitive Classifier Selection

2.4.2. Multi-Contextual Selector Module

2.4.3. Weighted Balanced Sampling

2.4.4. Competence-Based Weighting and Aggregation

2.5. Model Evaluation

3. Experimentation and Results

3.1. GoEmotions Dataset

3.1.1. Label Mapping

3.1.2. GoEmotions Preprocessing

3.2. Experimental Work

3.2.1. Handling Overlapping Emotions

3.2.2. Ablation Study

3.3. Comparison to the State-of-the-Art Models

4. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI