This section is divided into two parts. In the first part, the authors provide an overview of the data, key insights, and what was prepared for further modeling, mainly focusing on the dataset labels, e.g., the y part of the data. In the second part, they describe the approach for dividing the dataset into train, test, and split subsets. As for the third part, the authors review the investigation of texts, the X part of the data, and provide a text standardization layer to integrate into the model. The result of this section is a comprehensive set of tools for building and training ML models on the GoEmotions dataset.
4.1. Dataset Investigation
First of all, before any modeling, it is necessary to research what we are modeling. The authors chose the GoEmotions dataset, which contains a large collection of diverse text data covering a wide range of topics and emotions. This makes it ideal for training and testing emotional analysis models, as it provides a rich and varied source of data for researchers to work with [
7]. This dataset contains text comments extracted from the Reddit platform [
8]. Reddit is a social media platform for discussion, where people share their interests and run discussions on different topics, for example, mass culture, computer games, movies, gadgets, relationships, etc.
The GoEmotions dataset contains comments from 2005 (the start of Reddit) to January 2019, which were selected from subreddits with at least 10,000 comments [
7]. The dataset includes only English comments with the use of chat-style language. It is important to mention that the dataset is impersonalized and contains [NAME] and [RELIGION] placeholders to reduce bias.
There are 54,263 comments included in the dataset, divided by the authors into three parts for modeling: training (43,410 elements), validation (5426), and testing (5427). As of now, all three parts are combined in a single dataset for review. The average length of text sequences is 68 characters, and the average number of words is 13.
The authors operate with emotions provided by the GoEmotions dataset. According to its paper, all comments were manually labeled with 27 emotion categories: admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise, and a neutral class.
The authors of the GoEmotions paper aim to “Provide the greatest coverage in terms of kinds of emotional expression”. As we can see from the list of emotions present in the dataset, it mostly intersects with Robert Plutchik’s emotion wheel, although the emotion categories “anticipation” and “trust” are not present. The correlation of emotions is out of the scope of this study. However, analysis of metrics reached on the testing dataset showed that the model sometimes confuses semantically close classes (
Section 6.2).
The 27 emotion categories are divided into three sentiments: positive, negative, and ambiguous. While emotion is a psychological response of a writer of a text comment, sentiment shows a resultant feeling caused by this emotion. From the data standpoint, sentiment is a top-level group of emotions.
To summarize, this dataset provides a comprehensive list of emotions, which makes it appropriate for emotion classification.
Some of the texts are labeled with more than one category. The distribution of categories’ counts per text is presented in
Figure 1.
As we can see, text comments are usually labeled with only one emotion. However, a significant value is added by texts labeled with two categories. Therefore, we need to solve not just the multiclass classification problem but also the multilabel classification, meaning that the model should be able to predict more than one emotion category. The other approach is to exclude texts with two or more labels or select only one category for each of them. Consequently, the model will solve a multiclass classification problem by predicting only one correct class.
For the sake of clarity, we need to explain the difference between multiclass and multilabel classification problems. If a classification model determines if an element belongs to one of many groups, then the model solves the multiclass classification problem. The prediction will be a single class that fits best to this element. However, if an element can belong to several classes, then the model solves the multilabel classification problem. In this case, the result prediction will be a list of classes with zero, one, two, or more classes that fit an item.
It is common to express different emotions in a single message, for example, surprise and confusion: «Wow! I didn’t expect this to happen. What should we do next?». This means that the second approach, which requires excluding texts with two or more emotion categories, would reduce model accuracy and usability. Consequently, the authors decided to follow the first approach, making a model that provides independent predictions of each emotion category and, thus, can predict more than one emotion for a text piece.
A common phenomenon for emotions analysis is emotions mixing. Mixed emotions are described as an expression of two or more emotions, usually the opposite ones. As mentioned before, a significant part of texts in the dataset is labeled with two or more emotions. Texts can be labeled with close emotions. Consequently, to verify if emotion mixing is presented in the dataset, we can select items labeled with emotions related to different sentiments. This results in 4200 texts selected. For example: “That was my first live Lions game (British and flights to DTW are expensive) I’ve never gone through so emotions in such a short amount of time” (excitement, realization, surprise), “How did they respond? You can’t leave out the best part” (curiosity, disapproval).
It is also worth checking if the GoEmotions dataset contains texts labeled with opposite emotions according to their sentiments. That includes 981 elements. For instance: “I can watch test cricket, this would be no problems. Honestly though I’m annoyed I missed it, hopefully there’s a re run” (annoyance, desire), “If he isn’t respecting your clear boundaries then you should block him and move on with your life. He’s not worth your time” (caring, disapproval).
Having such data supports the authors’ decision to train the model to give independent predictions of each emotion category. However, the topic of emotions mixing is not further investigated in this study.
Taking into account that predictions are independent, we will remove the neutral class from the dataset, meaning that the text lacks any emotional characteristics and belongs to the neutral class if no emotions were triggered.
In
Figure 2, we can see the class distribution in the dataset. As we can see, the dataset is imbalanced. For example, the “admiration” class consists of 5122 elements, the “approval” class has 3687, while the “grief” category contains only 96. As mentioned, the neutral class is to be removed and thus can be ignored in the current analysis.
Considering that we are solving a multiclass, multilabel classification problem, the authors need to encode labels accordingly. Consequently, the authors applied the strategy of “Problem transformation, whereby a multilabel problem is transformed into one or more single-label (i.e., binary or multiclass) problems” [
18] and encoded labels using a multi-hot approach. Hence, 27 target variables were created, and each of them represents a binary value, whether the text is labeled with its emotion category or not.
4.2. Train, Validation, Test Split
The process of training a neural network requires having three parts of data. The first and the biggest one is the train. It is the data the model sees and updates parameters by computing a loss function using backpropagation of error. This process is repeated several times, and each run is called an epoch. After each epoch, the model is evaluated by making predictions on the validation dataset and running metrics. This is necessary for tracking the model train process. Finally, when the training is completed, the model makes predictions on the test dataset, and its predictions are evaluated using metrics.
The dataset is already split into these three parts. However, the authors wanted to change the fraction of these parts during the experiments. Therefore, they decided to combine all three parts and implement a split. Since the dataset is imbalanced, the authors faced the issue that small classes could be unequally divided. Thus, they decided to implement the division of each category separately into train, validation, and test subsets.
In addition to this, the authors wanted to create a confusion matrix plot on the test dataset for all classes using the all-vs-all strategy. This can be achieved only by comparing one predicted label with only one true label. Consequently, the authors implemented a parameter for selecting only single-labeled texts in the test part.
During modeling experiments, it was noticed that models could not predict low classes, such as grief, relief, nervousness, etc. As a result, the authors added support for applying oversampling to the train data over the threshold supplied. For example, if the threshold value is 500, elements in classes of less than 500 are randomly repeated multiple times to reach 500 items. Finally, they added printing the random seed used for train, validation, test split, and support for re-using it. As a result, datasets can be re-created. The train part is also randomly reshuffled after creation.
All code implementing this functionality can be found in the utils.make_dataframes function.
4.3. Text Standardization
Training a high-quality model requires operating with clean data. The better the data are pre-processed and standardized, the more accurate model predictions will be, especially considering that this study follows the transfer learning approach using the pre-trained BERT model. Authors of BERT used BooksCorpus and English Wikipedia datasets [
6] for pre-training; thus, it is important to make text messages have a standard language form.
Text pre-processing is widely addressed in natural language processing and text classification studies. Each problem and data context requires an individual approach. A relevant example of data pre-processing in the healthcare domain is presented in the study “An Analysis on Large Language Models in Healthcare: A Case Study of BioBERT” [
19]. Authors describe general data cleansing, as well as standardizing medical terms and applying custom tokenization “to accommodate the unique vocabulary and structure of biomedical and clinical texts… specialized tokenizers may be needed to handle medical terminology, abbreviations, and symbols” [
19].
A similar problem is faced in the current research and presented later: the wide use of chat-style language with slang abbreviations, different word spelling, etc. This lexis appears as different tokens for the machine learning model. Hence, the model should be either initially trained on this data or adapted to processing such data. The first approach is not applicable since this study employs the transfer learning method, while the second one can be achieved by implementing custom tokenizers or additional pre-processing of the data.
Another relevant study is “CARER: Contextualized Affect Representations for Emotion Recognition” [
20], in which authors perform emotion recognition on the dataset extracted from Twitter. Despite differences in audiences and texting styles, both Twitter and Reddit are discussion platforms on the Internet and present similar language.
Authors describe an identical problem of slang and coded words having the same meaning, such as “tnx” and “thanks” or “waaaaking me” instead of “waking me”. That research provides its solution using “graph-based pattern representations” to extract emotion-relevant information.
This sub-section continues with the authors’ research of the language used in the GoEmotions dataset and the implementation of the cleansing process step by step. This text pre-processing can be simmilarly applied to other datasets sourced from Internet discussion platforms, such as Reddit, Twitter, etc. However, additional processing and analysis might also be required, especially if texts are inclined towards some specific domain, as in the BioBERT study example mentioned before.
To begin with, the authors apply the unidecode module [
12] to the text to remove non-ASCII characters. The main advantage of this tool is that it will not only remove any wrong characters, such as unreadable spaces and emojis but also replace them with correct interpretation if possible. For example, the “Latin small letter a with diaeresis” character (ä) becomes the “Latin small letter a”, while the “single comma quotation mark” character becomes a usual “single quotation mark”. This resolves any encoding issues, as well as removes different spellings of the same word.
The dataset is depersonalized and contains placeholders instead of name or religion references. For example, “I’m not talking about [RELIGION] anymore though…”, “[NAME]… I’m sorry. This is just wrong. I, can’t.” These placeholders were removed from the text so that they do not affect the model and the model will be less biased.
The dataset contains text commentaries written in English using an informal chat-style language. To give it a standard look, the authors replace English contractions [
21] with their full spelling where it is unambiguous. For instance, “can’t” is simply replaced with “can not”. At the same time, “I’ll” is left as is because it can either be “I will” or “I shall”, and we can not exactly determine the correct form. Moreover, the form can affect the way the intention is expressed.
It is common for the chat style to stretch vowels and some consonant letters. For example, the following messages: “how’d you know? they’re soooo good”, “Loooool I didn’t know that it’s ridiculous”. Regardless of the text data encoding algorithm, the words “soooooo” and “so” will have different encodings. To resolve this issue, in all cases where the letters ‘a’, ‘e’, ‘i’, ‘o’, ‘u’, ‘y’, ‘s’, ‘h’, ‘f’, ‘r’, or ‘m’ are repeated more than three times, it is replaced with only one occurrence. The examples shown before become “how’d you know? they’re so good” and “Lol I didn’t know that it’s ridiculous”.
The next step taken was to replace abbreviations and chat words with their full phrases and meaning [
22]. For instance, “ASAP” is replaced with “as soon as possible”, while “L8R” becomes “later”, and so on. In addition to this, it is common in chat style to use “r” instead of “are”, “u” instead of “you”, and “@” instead of “at”. All of these are replaced with their full spelling.
What is more, all words with numbers and any punctuation characters except for a comma, dot, hyphen, apostrophe quotation, and exclamation marks are removed from the text. In addition to this punctuation fixing, multiple punctuation characters are replaced with only one of them. For example, “I love this!!! You got it!” becomes “I love this! You got it!”. Moreover, punctuations are changed to a proper form, without spaces before and with only one space after.
Finally, leading and trailing spaces are trimmed, multiple spaces are replaced with only one, and text is converted to lowercase.
All of the processing above was packed into the standardized TextStandardizeLayer class that implements a TensorFlow [
23] layer and can be integrated directly into the model before the encoder. This layer can also be used in other models working with chat language.
Although it comes as a ready-to-use solution, it is still necessary to investigate the data and set up text standardization according to the task. Additional functionality can be either added to the proposed layer or implemented as a separate layer before or after it.