1. Introduction
The study of sentiments and emotions expressed in texts has been part of the Natural Language Processing field since the 1990s, increasing in interest in the last two decades due to the large amount of texts available on the Internet [
1], especially messages on social networks where large numbers of people give their opinion on all relevant topics and express their personal emotions. Within this area, work has been done mainly on the classification of texts by their polarity or valence (positive or negative, differentiating in general neutral texts), this problem is usually called Sentiment Analysis. The detection of more complex emotions, on the other hand, is a problem of greater complexity, where a finer classification of texts is made, taking into account different emotions, such as joy, sadness, anger, surprise, among others. Emotion classification requires more expensive resources than the usual ones for polarity classification, as it requires more examples to cover the full set of classes, more annotators, and more attention to differentiate a varied number of classes and achieve a reliable inter-annotator agreement.
This effort is worthwhile as it could help in the detection of different problems that people experience and frequently express in social networks. Situations of harassment, signs of mental health problems, hate speech, and many other situations, can be detected by analyzing the affective content of texts. Furthermore, the study of emotions can be a tool for decision making in the political or business environment, where knowing the opinions and feelings of citizens or users can be very useful.
In order to face the problem of detecting emotions in texts, it is necessary to define the set of classes to be considered, and for this purpose work has been done on the basis of psychological studies. Several studies have worked on the scheme of three dimensions (valence, arousal, and dominance) presented by Wundt [
2], carrying out different experiments that support this approach [
3,
4,
5]. Other authors have defined sets of basic emotions, such as the set of six classes proposed by Ekman [
6]:
anger, disgust,
fear,
happiness,
sadness, and
surprise; or the Plutchik’s set of eight emotions [
7]:
anger,
fear,
sadness,
disgust,
surprise,
anticipation,
trust, and
joy. These theoretical proposals have led to the creation of different emotion lexicons (see
Section 2).
Although having specific lexicons can help with emotion detection, the main resources for text classification currently are annotated datasets for training machine learning models. Large datasets can be used with deep neural network approaches, usually without the need to define attributes such as emotion word counts using a lexicon. Creating these resources is costly and imprecise, since emotions must be interpreted, which places us in a clearly subjective field. Nonetheless, corpus annotation tasks have been carried out on the basis of different sets of emotions, using different types of texts and different annotation schemes, e.g., single-label or multi-label classification.
In this paper, we present a revision of the resources available for emotion analysis, focusing on resources for the Spanish language, and previous work on automatic emotion classification. We also describe a new dataset built by merging two existing emotion datasets for Spanish and then we present some experiments performed on the new dataset, taking as a starting point the systems we sent to the EmoEvalEs task at IberLEF 2021 [
8,
9]. Finally, we analyze the most problematic classes.
The paper is organized as follows. In
Section 2, we present the related work. In
Section 3 we describe the materials (the new corpus) and the experiments on automatic detection of emotions. In
Section 4, we show and analyze the results of the experiments. Finally, in
Section 5 we present the conclusions of the work.
4. Results and Discussion
Table 3 shows the results on the development and test corpora from EmoEvalEs of the neural model trained with the EmoEvent corpus, on the one hand, and with the EmoEvent + SemEval corpus, on the other hand. The metrics we used for evaluation are Accuracy (Acc) and Weighted F1 (W-F1).
Table 4 shows the results of our current best model, which uses data from EmoEvent and the Spanish set of SemEval data, compared to the best and worst systems in EmoEvalEs. We also include our own submission to EmoEvalEs for comparison.
The results seem to indicate a slight improvement training with the extended train corpus, over training only with the EmoEvent corpus, but, as the confidence intervals are not completely separate, further experiments would have to be performed to confirm this. This result is of particular interest since the new corpus contains tweets on different topics, not only tweets on some specific events, as is the case with EmoEvent. The combination of the two corpora could have had a negative effect on the results, compared to the training with the original corpus, since the test corpus contains exclusively tweets related to the events selected for building the EmoEvent corpus.
Looking at the confusion matrix of the model trained with the EmoEvent + SemEval corpus, shown in
Figure 1, we see that almost all classes tend to be confused with the class
others. Due to the way this class was generated [
22], it is expected that many of these tweets express some emotion, since tweets that received different emotions by different annotators were assigned to the
others category. It is not a category representing tweets without emotion, but tweets with some emotion or a mixture of several ones, and probably also neutral tweets. Something similar happened with the
neutral class of the dataset for sentiment analysis of the TASS task, where tweets with a neutral polarity and also tweets with mixed polarity, i.e., with both positive and negative nuances, could be found. In [
38], we discuss this problem and show a detailed analysis of tweets belonging to that category.
This can be seen graphically in
Figure 2 as well. In this diagram, the crossing arcs represent cases in which the classifier is wrong, and the bumps inside categories represent cases when the classifier is right. We can see in this diagram that the
others category is the most numerous, and approximately a quarter of their expected values are classified as other categories. The class
joy, on the other hand, has close to half of its examples classified as
others. Note that categories
surprise,
fear, and
anger have almost no visible bump, as almost no example of these categories is correctly classified. Furthermore, we can see that the
disgust class is wrongly mistaken with the
anger and
others category in similar proportions.
An experiment performed with a new version of the corpus excluding the
others category allows us to see how the remaining categories are better classified. In
Table 5, we show the results for each category using both versions of the corpus: with and without the
others class.
As can be seen, the overall measures improve significantly. On the test corpus, Accuracy rises 5.87 points and Weighted F1 rises 5.50 points, the increase being even greater in the fear (+22) and joy (+25) categories. However, the most problematic class (disgust) does not improve.
Some examples of others tweets from the training corpus show the diversity we can find in this class:
A sad tweet: Guardaré en mis ojos tu última mirada… #notredame #paris #francia #photography #streetphotography;
A clearly positive tweet, that could even have been annotated as joy: Que clase táctica están dando estos dos Equipos… bendita #ChampionsLeague;
An informative tweet, with no emotion: El escrutinio en el Senado va mucho más lento. Solo el 14.85% del voto escrutado #28A #ElecccionesGenerales28A.
Besides the others class, we carried some more experiments on the three more difficult classes to detect (disgust, fear, and surprise), to try to understand what made these categories so difficult. One interesting experiment is trying to analyze which words are the most relevant ones for each class, the ones that would let us tell apart a tweet in one of these categories with the highest confidence. In order to do this, we trained several variants of SVM classifiers using different lists of BOW features found with the ANOVA F-value method. For these classifiers, we were only trying to classify a class against all the rest, for example: disgust vs. no-disgust, or fear vs. no-fear.