A Survey on Datasets for Emotion Recognition from Vision: Limitations and In-the-Wild Applicability

Costa, Willams; Talavera, Estefanía; Oliveira, Renato; Figueiredo, Lucas; Teixeira, João Marcelo; Lima, João Paulo; Teichrieb, Veronica

doi:10.3390/app13095697

Open AccessReview

A Survey on Datasets for Emotion Recognition from Vision: Limitations and In-the-Wild Applicability

by

Willams Costa

^1,*

,

Estefanía Talavera

²

,

Renato Oliveira

¹,

Lucas Figueiredo

^1,3

,

João Marcelo Teixeira

¹

,

João Paulo Lima

^1,4

and

Veronica Teichrieb

¹

Voxar Labs, Centro de Informática, Universidade Federal de Pernambuco, Av. Jornalista Aníbal Fernandes s/n, Recife 50740-560, Brazil

²

Data Management and Biometrics Group, University of Twente, Drienerlolaan 5, 7522 NB Enschede, The Netherlands

³

Unidade Acadêmica de Belo Jardim, Universidade Federal Rural de Pernambuco, PE-166 100, Belo Jardim 55150-000, Brazil

⁴

Visual Computing Lab, Departamento de Computação, Universidade Federal Rural de Pernambuco, Pç. Farias Neves 2, Recife 52171-900, Brazil

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(9), 5697; https://doi.org/10.3390/app13095697

Submission received: 6 April 2023 / Revised: 29 April 2023 / Accepted: 2 May 2023 / Published: 5 May 2023

Download

Browse Figures

Versions Notes

Abstract

:

Emotion recognition is the task of identifying and understanding human emotions from data. In the field of computer vision, there is a growing interest due to the wide range of possible applications in smart cities, health, marketing, and surveillance, among others. To date, several datasets have been proposed to allow techniques to be trained, validated, and finally deployed to production. However, these techniques have several limitations related to the construction of these datasets. In this work, we survey the datasets currently employed in state-of-the-art emotion recognition, to list and discuss their applicability and limitations in real-world scenarios. We propose experiments on the data to extract essential insights related to the provided visual information in each dataset and discuss how they impact the training and validation of techniques. We also investigate the presence of nonverbal cues in the datasets and propose experiments regarding their representativeness, visibility, and data quality. Among other discussions, we show that EMOTIC has more diverse context representations than CAER, however, with conflicting annotations. Finally, we discuss application scenarios and how techniques to approach them could leverage these datasets, suggesting approaches based on findings from these datasets to help guide future research and deployment. With this work we expect to provide a roadmap for upcoming research and experimentation in emotion recognition under real-world conditions.

Keywords:

emotion recognition; emotion recognition datasets; emotion perception; in-the-wild emotion recognition; affective computing; multi-cue emotion recognition; human behavior recognition; computer vision; survey; nonverbal emotion communication

1. Introduction

The recognition of human behavior is a task that has been of increasing interest to our society in recent years due to its ability to enable a deeper understanding of users in the most diverse scenarios. By monitoring humans, machine intelligence systems can detect patterns in behavior and suggest changes that could improve interaction and quality of experience. One of the many sub-tasks involved in the recognition of behavior is called emotion recognition, which, from multiple definitions in both computer science and psychology research, can be defined as the process of identifying emotions based on nonverbal cues given by the person [1,2,3,4,5,6,7]. The nonverbal aspect is essential in many contexts where the user needs to interact with environments instead of systems, allowing for an unobtrusive experience and a perception of emotion that is not forced.

Especially in the context of smart environments, which can be extended to smart cities, understanding their user is crucial. Given the focus of smart cities on building more attractive and lively spaces, physical sensors and internet-of-things frameworks are typically deployed to measure various aspects of a region, such as weather, pollution, and traffic. However, they need more focus on understanding how their citizens feel near specific areas in the city and how changes could impact their perception and feeling about those spaces [8].

There have been studies focused on understanding the impact of emotion recognition approaches for decision making in these scenarios, mainly to understand the correlation of green spaces, such as public parks and forests, on the urban scene. For instance, the work by [9] studied preferences of ornamental plant species for green spaces by leveraging emotion perception. In addition, the work by [10] used selfies to extract emotion, revealing that visitors from parks near the city center are more susceptible to displaying negative emotions. Other works have focused on different scenarios, such as [11], which investigated the relationship between perceived emotions on the university campus on variables such as environment and habitation availability, and [12], which investigated how sound pollution can affect perceived emotion. Therefore, by focusing on human factors, smart cities can engage with the citizens in a personalized way and have real-time, not modeled, feedback about their services or public spaces.

Along the nonverbal axis, there are many cues that humans can naturally perceive from different points of view. For example, involuntary movements (facial expressions, body language, and eye gaze), voluntary movements (touching and avoidance of touching by the peer), or conversational cues (changes in subject, lies, superficial talk, and variations in speech tonality). Although each sub-group of nonverbal communication has significant research works that propose approaches for understanding them, in this work, we focus on those which can be understood through vision-based approaches and in the wild, discarding, for instance, datasets for emotion recognition based on speech tonality, since in these uncontrolled scenarios people will not always be in conversation with peers and a high-quality recording of the voice would be very difficult to achieve.

The usage of other nonverbal cues is what currently separates the literature into two tasks: facial expression recognition (FER) and emotion recognition. The first task is motivated by the fact that facial expressions carry many discriminative features that we can correlate with emotion. Early approaches for FER focused on lab-controlled scenarios and posed expressions. Throughout the years, competitions such as Emotion Recognition in the Wild (EmotiW) [13] have been driving this task towards less restrictive approaches that could also be deployed in the wild. However, such techniques are still restricted to scenarios with a visible face and are harmed when the face is occluded or not directly aligned with the camera [14,15]. These restrictions motivated researchers to fuse facial expressions with other types of nonverbal cues, these approaches are what is known today as emotion recognition. Approaches such as the work by Kosti et al. [16], context-aware emotion recognition networks (CAER-Net) [17] and global-local attention for emotion recognition network (GLAMOR-Net) [18] proposed the extraction of background features referenced to a context, while approaches such as emotion recognition using adaptive multi-cues (EmotiRAM) [19] and the work by Chen et al. [20] proposed, beyond context, to leverage body language. Therefore, combining multiple cues may lead to more robust techniques since, in case of the occlusion of a specific cue, the model could rely on other cues to make predictions.

As the evaluations on the benchmarks evolve, there is a tendency to deploy these systems in real-life scenarios to contribute to society and generate results outside the academic scenario, in which we usually validate datasets. However, it is well-known that deep learning systems, such as those employed for emotion recognition techniques, depend highly on data availability and quality. Although there are several datasets for this task, each dataset has its particularities and limitations, which could severely harm the capacity of execution in real scenarios. Furthermore, there are limitations regarding sample bias, in which samples present in the dataset are not representative of the real world, and recall bias, which is caused by the way multiple annotators are usually handled in these scenarios.

In this work, we survey datasets currently used for benchmarking techniques for emotion recognition. Specifically, we focus on vision-based datasets containing images or videos for evaluation. Therefore, we do not evaluate datasets focusing on speech tonality, for example. Some of the contributions of our work are:

We review how the behavioral psychology literature describes emotion perception in humans and correlate this with current datasets (Section 2);
We survey and list the datasets currently employed for benchmarking in the state-of-the-art techniques (Section 3);
We explore the difference in the annotations of these datasets and how they could impact training and evaluating techniques (Section 3.2);
We discuss annotations described using continuous models, such as the valence, arousal and dominance (VAD) model, and how they can harm the ability of a model to understand emotion (Section 3.3);
We investigate the presence of nonverbal cues beyond facial expression in the datasets of the state-of-the-art techniques and propose experiments regarding their representativeness, visibility, and data quality (Section 3.4);
We discuss possible application scenarios for emotion recognition extracted from published works and how each dataset can impact positively and negatively according to its features (Section 4).

This survey also differs from other published surveys in emotion recognition, such as [21,22,23,24,25], since we focus on listing, understanding, and discussing datasets and the permissions and limitations they bring to techniques, while surveys usually focus on listing techniques, their limitations, specifications, and results on benchmarks. To the best of the authors’ knowledge, this work is the first survey that focuses on datasets instead of techniques.

2. Perceived Emotion in Humans

Although emotion recognition is a multidisciplinary field of study, it is currently highly dependent on the advances made from research on cognitive psychology. With the rise of this field in the 1950s, theories based on cognitive approaches have gained significant appeal by connecting sentiment and behavior [26]. Current research also proposes a significant amount of sub-fields that study specific behaviors which we can correlate with emotion, such as kinesics [27], haptics (or tactile) communication [28], and proxemics [29]. Undoubtedly, the advances in these fields will directly contribute to new, exciting approaches to emotion recognition using deep learning. Furthermore, given how emotion recognition is a cognitive task, we can develop models that could try to imitate at some level the way that humans perceive emotion.

However, there are already many discoveries and discussions in the literature. Darwin, for instance, in his work The expression of the emotions in man and animals [30], studied how our emotions influence our body movements. This publication is not new but instead goes back one and a half centuries. Other significant researchers, such as Ekman [31] and Wallbott [4,32], have also contributed profoundly to the perception of this field. Therefore, we hypothesize that the limiting factor for the advance of this technology is not related to a lack of computational power or the availability of state-of-the-art deep learning models; but the lack of quality data that researchers could use to train such models. Designing and generating a dataset is expensive, and this factor could be even more impactful for more specific tasks, which is the case for emotion recognition and general fields that need to create descriptions for concepts that are natural and innate to humans.

2.1. The Portrayal of Emotion

The portrayal of emotion comes naturally to humans due to the connection between personality, mood, sentiment, and physical behavior. We communicate our emotions in two primary forms: verbal communication, when one uses words to communicate feelings, and nonverbal communication, in which we give cues of how we feel on different channels and at different intensities. The latter can constitute up to 50% of what we are communicating [3].

Darwin’s work [30] proposed a view regarding the role of body movements and facial expressions in nonverbal communication. Pride, for instance, makes a person exhibit a sense of superiority by erecting their head and body, making them appear as large as possible, as if they were puffed up with pride. In a later study, ref. [33] showed that sighted, blind, and even congenitally blind individuals from multiple cultures portrayed similar behaviors in response to success and failure situations at the Olympic or Paralympic Games. Therefore, this study shows that these expressions are not simply stereotypes mimicked between people; rather, this behavior may be biologically innate to humans. In Table 1, we present other examples of correlations between body movements and emotions as proposed by Darwin in his work.

Other works over the years have investigated different approaches related to body language. Ref. [4] proposed an experiment to understand which behavioral cues related to body movement and language people would use to judge emotions from six professional actors hired to play two scenarios. From this study, the judges employed the number of hand movements, head orientation, and behavior of movements, such as an energetic movement, to classify the emotion. Later, ref. [32] proposed another experiment to demonstrate that some movements and postures are, to some degree, specific to certain emotions and can be explained by the intensity of these actions. Their results agreed with Darwin’s evaluation, pointing out that, for example, people feeling ashamed or sad usually have a more collapsed body posture.

Among various other researchers, Ekman’s work to describe facial expressions and their correlation with emotion has been widely accepted and employed in multiple levels for different works descending from affective computing. One of the categories of expressions is called momentary expressions because the information they convey can be captured instantly. This means that one could take a snapshot at any point when the expression is at its apex and this snapshot would easily convey that person’s emotion [31]. This is an argument for datasets based on images, for example. Another possible type of expression is one extended in time, in which a sequence of signals provides the necessary cues. Ref. [34] discusses that emotions such as embarrassment, amusement, and shame have a distinct nonverbal display on facial expressions and other cues related to the facial area. For example, embarrassment is marked by gaze aversion, shifty eyes, speech disturbances, face touches, and nervous smiles. This is an argument for datasets based on videos, for example.

However, given how emotions are intrinsically innate to each person, it is difficult to argue which emotion a specific person is feeling. This is why a discussion regarding perceived emotion is more accurate, meaning understanding how that person exposes their emotion naturally or in a posed manner. Therefore, can there be emotions without facial expressions? As Ekman [31] discussed, individuals may not show any visible evidence of emotion in the face. Tassinary and Cacioppo Tassinary and Cacioppo [35] showed in their work that by employing surface electromyography, they could record slight electrical variations in the face. This means that even when someone does not display emotions in their face, they do so in an unobservable manner.

People can also try to fabricate expressions when they do not feel any emotion, in an attempt to mislead the observer. Evidence suggests that emotions such as enjoyment, anger, fear, and sadness contain muscular actions that most people are unable to perform voluntarily, meaning that, although this is possible, it is difficult to do so in a convincing manner [31,36]. This can be expanded into other aspects of behavior that one may want to disguise. For example, suppose one deliberately wants to hide their visible emotion by forcing expressions on their face, in that case, cues from the felt emotion could be perceived from other sources, especially body language [37].

2.2. Association between Nonverbal Cues

Cues are presented simultaneously, and humans rely on multiple cues to decode emotion. Therefore, instead of designing an approach that focuses on a specific cue, looking at the entire scene may be more informative, including the context in which the person is placed. The context has a direct influence on emotion, and the amount of influence is usually divided into three broad categories by the literature on psychology, discerning personal features, such as demographics and traits, situational features, such as goals and social norms, and cultural features, such as interdependence and nation wealth [38].

Therefore, situational features should be encoded in datasets, given that their reflection of the situation will impact someone’s emotions. The current placement of the person will have a direct impact on their emotions. Examples of placements that could positively impact emotions could be parties, public parks, and natural landscapes, while obscure, dark scenarios could indicate negative emotions. In addition, the presence of other people in the scene could be used as a cue for perceiving emotion. For example, in samples with group interaction, the individual emotions of the group members influence each other [39]. Studies show that perceivers automatically encode such contextual information to make a judgment [6], and that altering this background information could lead to an altered perception of the same facial expression [40].

3. Datasets used in State-of-the-Art Techniques

In this section, we will introduce the datasets currently employed as benchmarks for state-of-the-art techniques. Rather than providing an exhaustive overview of past datasets for emotion recognition, we focus on the recent efforts currently being employed and evaluated in the state-of-the-art techniques. We show samples from these datasets in Figure 1 and give an overview of these datasets in Table 2. As discussed previously, due to the importance of context and other nonverbal cues for perceiving emotion, we will focus our efforts on datasets that explore such features or have them available, discarding, for instance, datasets for facial expression recognition (FER) in this evaluation. Therefore, datasets such as AffectNet [41], the facial expression recognition 2013 (FER2013) dataset [42], and the extended Cohn–Kanade dataset (CK+) [43] will not be considered in this evaluation, neither will techniques that evaluate solely using these datasets, such as the technique proposed by Savchenko et al. [44], the work by Kollias and Zafeiriou [45], distract your attention (DAN) [46], exploiting emotional dependencies with graph convolutional networks for facial expression recognition (Emotion-GCN) [47], the model proposed by Ryumina et al. [48], the model proposed by Aouayeb et al. [49], frame attention networks (FAN) [50], and many others that place themselves as benchmarks for FER.

The Acted Facial Expressions in the Wild (AfeW) dataset [51] was proposed to tackle the limitation imposed by the lack of data from real-world scenarios. At the time of publication, datasets for this task were mainly composed of images recorded in laboratory scenarios with posed expressions. The authors argue that, ideally a dataset for this task would be recorded with spontaneous expressions and in real-world environments, which continues to be difficult today. Therefore, they propose using scenes extracted from movies, that contain environments close to the real world. We show samples of this dataset in Figure 1a.

The authors extracted clips from 54 movies containing subtitles, which were later parsed for expression-related keywords that could assist the labeler. Two labelers annotated each sample with an expression. For each person present in the scene, the labels also contained information on the head pose, age of character, gender, and expression of the person. A static subset, SfeW, was also generated from this dataset. They generated 1426 video clips with sequence lengths from 300 to 5400 ms and with 330 subjects (performers). Later, the AfeW was extended to the AfeW-VA dataset [52], which labeled emotion using a continuous model. We will discuss these annotations in Section refsub:cont.

Table 2. Overview of the datasets commonly used in the state-of-the-art techniques. Each row contains a different dataset. VAD or VA are continuous annotation models. For the demographics, M and F stand for male and female, respectively, while A, T, and C stand for adult, teenager, and children.

Dataset Name	Annotation Formats	Annotators	Samples	Demographics
EMOTIC [53]	Categorical—26 classes Continuous—VAD model	AMT ²	18,316 images	23,788 people—66% M and 34% F; 78% A, 11% T, and 11% C
EMOTIC ¹ [16]	Categorical—26 classes Continuous—VAD model	AMT ²	23,571 images	34,320 people—66% M and 34% F; 83% A, 7% T, and 10% C
CAER and CAER-S [17]	Categorical—6 classes + neutral	6 annotators	13,201 video clips 70,000 images	Unknown
iMiGUE [54]	Categorical—2 classes	5 (trained) annotators	359 video clips	72 adults—50% M and 50% F
AfeW [51]	Categorical—6 classes + neutral	2 annotators	1426 video clips	330 subjects aged from 1 to 70 years
AfeW-VA [52]	Continuous—VA model	2 (trained) annotators	600 video clips	240 subjects aged from 8 to 76 years 52% female
BoLD [55]	Categorical—26 classes Continuous—VAD model	AMT ²	26,164 video clips	48,037 instances—71% M and 29% F; 90.7% A, 6.9% T, and 2.5% C
GroupWalk [56]	Categorical—3 classes + neutral	10 annotators	45 video clips	3544 people

¹ A second version of EMOTIC was published in 2019. ² Amazon Mechanical Turk.

The Emotions in Context (EMOTIC) dataset [53] is a benchmark widely employed in the current literature for emotion recognition. The objective behind the construction of EMOTIC was to gather images and subjects in their natural, unconstrained environments, including contexts. The authors based their collection on well-established datasets such as Microsoft’s common objects in context (MSCOCO) dataset [57] and ADE20K [58], and expanded these samples with images captured using Google Images. The same authors later revisited the dataset in their 2019 work Context based emotion recognition using emotic dataset [16]. Please note that the latter version is discussed in this work. In addition, we show samples of this dataset in Figure 1b.

EMOTIC is a dataset widely used today in the literature due to its proximity to an in-the-wild setting. The authors used the Amazon Mechanical Turk service to bring annotators and ask about their perceived emotions from people and context. A first round of annotations is proposed, in which a single annotator annotates one image. After this, the test and validation set receives up to two and four extra annotations, leading to three and five annotators for these sets, respectively. The 2017 version contains 18,316 different images containing 23,788 annotated subjects, of which 66% are males and 34% are females, according to the authors. They also annotated the perceived age of the subjects, of which 11% were children, 11% were teenagers, and 78% were adults. The 2019 version contains 23,571 images containing 34,320 annotated subjects, with the same gender division as the 2017 version, but containing 10% children, 7% teenagers, and 83% adults.

The Context-Aware Emotion Recognition (CAER) dataset [17] was proposed to solve limitations perceived by the authors on other in and also in EMOTIC. According to the authors, although EMOTIC contains contextual information, they work on a different aspect and propose a large-scale dataset for context-aware emotion recognition with various context information. We propose an experiment to validate this affirmation later in this work. Another limitation, compared to EMOTIC, is that emotions are highly dynamic, and using images could be a limiting factor for techniques. Therefore, the authors propose the CAER dataset, which contains videos, and the CAER-S dataset, for static images, the most commonly used benchmark between these two.

Another main differing factor between CAER and EMOTIC is the sample source. While EMOTIC focuses on reusing images from other popular datasets and complementing them with images from the web, CAER focuses on extracting images from television shows. This yields questions regarding the expressivity of the emotions in the scenes. The psychology literature tackles the problem of posed emotions, and evidence suggests that they are at least an approximation of what is actually felt. Studies show that we can diminish the impact of these problems by using different actors in the experiments, that are unaware of the task—in this case, the actors were unaware that this footage would be used for the emotion recognition task. Therefore, we can consider that the negative impact is low [4,32,59].

The Body Language Dataset (BoLD) [55] has a different focus from EMOTIC and CAER. While also containing contextual information from background scenes, its focus is on the availability and description of body language, another significant nonverbal cue that the current literature has yet to adopt [19,20]. The authors’ proposal concerns a dataset with crowdsourced emotional information from videos that were validated using experiments with action recognition-based methods and Laban movement analysis features.

The construction process of the dataset is based on three stages related to selection (of clips and time period), annotation of pose (pose estimation and tracking), and perceived emotional features. Given the intersection of these cues for activity recognition, the authors choose clips from the AVA dataset [60] for annotation. Each person receives a unique ID number, which is tracked among frames, and has their emotions annotated. Focusing on the visibility of the body, close-up clips are removed.

The dataset contains 26,164 video clips with 48,037 instances, defined as an identifiable character with landmark tracking in a clip. Each instance is annotated by 5 participants, who annotate 20 samples. This dataset is widely representative with varied demographics, especially ethnicity categories that include sub-represented groups such as Hispanic or Latino, Native Hawaiian or Other Pacific Islander, for example. Although the dataset is promising, it is still not widely used in the literature on emotion recognition, even though it could lead to interesting in-the-wild applications.

The GroupWalk dataset [56] comprises videos recorded with stationary cameras in eight real-world scenarios. A total of 10 annotators annotated 3544 agents with visible faces across all videos. This dataset is still to be widely used in the current literature given its focus on a very specific nonverbal cue.

The Micro-Gesture Understanding and Emotion analysis (iMiGUE) dataset [54] focuses on nonverbal body gestures, specifically microgestures, to understand perceived emotion. The dataset contains samples extracted from post-match press conferences with professional athletes, which contain footage from interviews and questions after sports matches. In these cases, the player has little to no time to prepare for the questions. As a result, the expressions and reactions are usually not posed and have insightful reflections on the innate emotion.

The authors collected 359 videos with 258 wins and 101 losses from post-match grand slam tournament interviews, totaling 2092 min in duration. The dataset is identity-free, meaning that biometric data, such as the face and voice, have been masked. They also focus on ethnic diversity and gender balance. Five annotators were responsible for annotating the data, which contained information from body, head, hand, body+hand, and head+hand.

3.1. Techniques in the State of the Art

Recently, datasets recorded in the wild have gained attention for allowing evaluations in uncontrollable scenarios, such as laboratory environments. The AfeW benchmark [51] was one of the first public databases used for this evaluation scenario and was employed as the main dataset for evaluation at the emotion recognition in the wild (EmotiW) [61] challenges. Subsequently, datasets such as EMOTIC [16,53] and CAER [17] have been published to explore the participation of context in emotion recognition deeply. They are the datasets most commonly used today by the state-of-the-art techniques for evaluation.

In Table 3, we survey techniques from May 2019 to January 2023 and compare the participation of EMOTIC and CAER on the evaluation pipeline for these techniques. Although there are other promising datasets, especially BoLD [55], that could be used for emotion recognition, given that most of the techniques focus only on context, it is justified that EMOTIC and CAER still dominate the evaluation focus in the state-of-the-art techniques. iMiGUE [54] and GroupWalk [56] are also not very common as benchmarks, mainly due to their specificity. AfeW and AfeW-VA are also more common in approaches for FER. In these cases, they are listed here for also containing visible context information.

3.2. Emotion Categories and Annotations

Techniques related to affective computing usually focus on some variation of the well-known set of Ekman’s [72] basic emotions, which are Anger, Fear, Sadness, Enjoyment, Disgust, and Surprise. The same applies to datasets related to emotion recognition and its parent fields of study, which usually add the Neutral emotion.

This set of categorical annotations is the case for the CAER dataset [17], in which videos and images are annotated using Ekman’s basic emotions. A limitation, however, in the format of annotation used in both CAER and CAER-S, is that a single annotation is given for the video or image. This means that in images with multiple people, only one label is given, and no bounding box or identifiable information is provided to identify on which person that annotation was made. This harms techniques such as CAER-Net [17], GLAMOR-Net [18], and other techniques trained on this dataset, because the authors usually use the first face detected in the image, as discussed in their research papers. EmotiRAM [19] employs a face selector algorithm that searches for the leading performer in the scene based on the assumption that the annotation would be from them, but is also limited to images with difficult scenarios. We show examples of images with more than one person framed in Figure 2.

For GroupWalk [56], a completely different set of emotions is proposed. The authors proposed using a new set of seven categories that take into consideration the level of the perceived emotion, which can be Happy, Sad, or Angry, combined with the modifiers Somewhat or Extremely, in addition to Neutral.

The authors proposed a different approach for EMOTIC, by defining an extended list of 26 emotional categories also containing Ekman’s basic emotions, leading to 20 novel emotional states for comprehension, which was also employed in BoLD. To define these emotional categories, the authors proposed an approach based on word connections (affiliations and relevance of words) and interdependence of words (psychological and affective meaning) to form word groupings. However, the main difference is that neither EMOTIC or BoLD contain the Neutral category, with the authors arguing that, generally, at least one category can be applicable, even with low intensity.

However, the lack of a neutral annotation group could lead to neutral images being sampled into opposite groups and therefore harm the learning process of a network. For example, research on social cognition points out that humans perceive emotionally neutral faces depending on visible traits; positive traits are correlated with happiness, and traits involving dominance and threat are correlated with anger [73,74]. Therefore, a person that appears to be emotionally stable would be associated with Happy, while a person that appears to be aggressive would be associated with Angry. In this scenario, given a neutral image, and more specifically one with a neutral face, the action of the annotator would likely depend on their perception of the personality traits of the people present in the image instead of the perceived emotions.

Assessing Annotators’ Agreement

Could a high number of classes impact the agreement between annotators? We hypothesize that the various possibilities of classes could lead to uncertainty in annotation, leading to disagreement among annotators. Therefore, as an example, we investigate the agreement among annotators on the EMOTIC dataset.

The authors also proposed a study to assess the level of agreement among annotators. They employed a quantitative metric known as the Fleiss’s Kappa measure [75], which evaluates the reliability of agreement among a fixed number of raters in assigning categorical ratings. They showed that more than 50% of images had

κ > 0.30

, indicating that, in these cases, the annotations were better than random. As stated in Fleiss’s work, it is reasonable to interpret the absence of agreement among raters as their inability to distinguish subjects—in this case, different emotions. As is also proposed in Fleiss’s work, the authors did not evaluate the agreement by category.

We proposed an experiment to overview the agreement among annotators. We empirically chose to focus on the test set, given that this is how current state-of-the-art approaches evaluate their results. For this experiment, given a set of images of the test set

I = {I_{1}, \dots, I_{n}}

, containing a set of persons

P = {P_{1}, \dots, P_{n}}

, we loop each

P_{i}

for each

I_{i}

and store their annotations if the number of annotations for that person is higher than one. We then count the co-occurrence of each emotion for that sample in pairs and store this information in a co-occurrence matrix. For example, if the annotations for

I_{1}; P_{1}

are Happy, Engagement from annotator 1 and Peace from annotator 2, we would increment the co-occurrences of (Happy, Engagement), (Happy, Peace), (Engagement, Happy), (Engagement, Peace), (Peace, Happy), and (Peace, Engagement) by one. Therefore, we construct a database containing the co-occurrence of the annotations given by all annotators for each image and person. We display a visualization of this data in Figure 3 as a confusion matrix, to allow an easier visualization normalized by emotion. This means that, for a pair of emotions, the value shown indicates the co-occurrence of those emotions. If the value is higher than one, the annotators agreed more with this specific pair instead of agreeing on the emotion. If the value is less than one, they agreed less. Given the dataset’s license, users must provide their copy of EMOTIC’s annotation to reproduce this result.

From this overview, we can see a general agreement between classes. For example, the co-occurrence between Affection and Happiness is

0.99

(1879 samples), indicating a strong agreement between annotations, as there is between Suffering and Sadness, with a

0.77

agreement (426 samples). We can also see, however, the odd distribution of Engagement, which, by the authors’ definition, is paying attention to something; absorbed into something; curious; interested [16]. Sensitivity, which is defined as the feeling of being physically or emotionally wounded; feeling delicate or vulnerable shares co-occurrences with Happiness (

0.42

; 101 samples) and Sadness (

0.59

; 142 samples), even with these two having opposite definitions. Finally, we can also see outliers in this data, composed of directly opposite emotions. For example, Disconnection and Engagement (

0.68

; 1365 samples), Fear and Confidence (

0.35

; 104 samples), Fear and Excitement (

0.44

; 131 samples), and Pain and Happiness (

0.23

; 50 samples). A similar experiment was performed by the authors and is published in their reference paper, but using a different approach than ours. In our approach, we look directly at the co-occurrence of each pair of emotions and compute the average based on the number of samples, which reveals insights that were not present in their evaluation. These insights should be used by scientists and engineers when developing their applications to be aware of limitations present in the dataset.

3.3. Continuous Annotations

Some datasets also annotate samples with continuous dimensions instead of only discrete categories. For example, the VAD model [76] is usually employed for emotion recognition. In this model, emotions are placed in a three-dimensional space: valence, arousal, and dominance. The valence (V) axis determines whether an emotion is pleasant or unpleasant to the perceiver, therefore distinguishing between positive and negative emotions. The arousal (A) axis differentiates between active and passive emotions. Finally, the dominance (D) axis represents the control and dominance over the nature of the emotion. Therefore, each emotion can be represented as a linear combination of those three components [77].

The AfeW-VA dataset [52,78] contains annotations of valence and arousal for 600 clips, which were also, in part, used in AfeW, to create an expanded dataset. The values for each axis of the model range within [−10, 10]. However, as exposed in their work, a significant number of the annotations lie around the neutral value for valence and arousal, indicating that a significant part of the dataset contains neutral expressions. A possible reason for this behavior is that the entire dataset was annotated by just two individuals who, although certified on the facial action units coding system, could be suffering from annotator burnout [79].

The EMOTIC dataset also contains annotations using the VAD model, ranging within [1, 10]. However, unlike AfeW-VA, the authors chose to rely on crowdsourced annotations powered by the Amazon Mechanical Turk (AMT) platform, with restrictions to discard annotations that were, in their opinion, not compatible with their discrete annotations. These control images would appear once for every 18 images shown to the annotator. However, as reported in their work, the metrics related to annotation consistency point to disagreement in some cases. For example, the standard deviation for the dominance dimension is

2.12

, which can be considered high given that the values range within [1, 10]. For valence and arousal, the standard deviations are

1.41

and

0.70

, respectively. Finally, the score distribution for each dimension is higher than AfeW-VA’s, indicating that the perceived emotions are more diverse from the annotators’ points of view. We extract the mean for each annotated emotion’s valence, arousal, and dominance axis on EMOTIC, and we show this visualization in Figure 4. This figure shows that, as expected, we have a cluster of positive and negative emotions. Sadness and Suffering, for example, have almost the same VAD mean. At the same time, Pleasure and Happiness are also close. This visualization also points out annotations that become confused in this form of annotation, such as Peace and Fear.

Finally, the community does not agree that the VAD model is a good approach to representing emotions. It is challenging to represent emotional categories with numbers, Figure 4, given that this is different from how humans naturally perceive emotions, even more so when the emotion needs to be divided into different categories. The representation is also deeply intimate and changes from person to person and culture to culture. In datasets with continuous and categorical annotations, such as EMOTIC, one might employ weights between each annotation format to control their participation in calculating loss, for example.

3.4. Presence of Nonverbal Cues

Except for the AfeW and AfeW-VA datasets, all the other databases surveyed in this work were designed for the presence of nonverbal cues other than facial expressions in their samples. For example, EMOTIC and CAER/CAER-S focus on context; BoLD focuses on body language; GroupWalk focuses on gaits; and iMiGUE focuses on microexpressions of the body, hands, face, and the combination of them. However, it is possible to extract other nonverbal cues from these datasets, even if these were not the focus of their design. We list the presence and visibility of nonverbal cues in Table 4.

Given how facial expressions are severely significant for emotion recognition, it is important to have good crops of the face available, which is the case for most of the samples from these datasets. For EMOTIC, however, it is more common to see images with severe facial occlusion, as shown in Figure 5. For the other datasets, since the data were obtained from television clips or movies, it is more common for people to be aligned with the camera. For GroupWalk, given how security cameras and handheld devices are also used to record the clips, it is common that sometimes the faces will be distant from the camera, making it difficult to explore this cue.

3.4.1. Context Variability

Context information is also available in most samples of the datasets selected for evaluation in this survey. Even for datasets such as AfeW and AfeW-VA, which are commonly used for FER, background and scene information is available and can be used to leverage context. However, when proposing the CAER dataset, the authors argue that the list of datasets available until the date of publication (including EMOTIC) did not contain a dataset with multiple context information. When exploring CAER-S, we note that, even though the source of the samples is 79 television shows, a significant number of frames contain repeated background information, which does not occur in EMOTIC at the same significance due to the source of the data.

To validate the variability in background information (context) in EMOTIC and CAER-S, we propose an experiment using open-source code for scene recognition. For this, we use the work by [80], which extracts features from images for classification. Given an image I, we feed it to the pipeline of the proposed network to extract the scene label s, which describes at some level the context from the input image

I_{c}

. Please note that, although this technique is compatible with the state of the art with a sufficient Top@1 score, our intent is not to quantify the different context information to disclose how many images have a specific background, but rather to understand if these images are different enough to have different classifications among the dataset.

First, we modified the data loaders proposed by the authors, to load images from the CAER-S and EMOTIC datasets separately. After this, we employed the same transforms used in their evaluation to extract comparable information. We decided empirically to use the model pre-trained on the Places 365 dataset [81] given its high data variability. After this, we sampled each image and extracted its Top@1 classification, storing it in a file for further processing. We show the results of this experiment in Figure 6. This graph shows that the classifications given by [80] are more grouped in CAER-S than in EMOTIC, implying that the images in this dataset are less diverse. Please note that, although the datasets contain a different number of samples, our investigation is regarding data distribution. A dataset with images containing diverse backgrounds would be more distributed among this graph, as happens with EMOTIC, independent of the number of samples present in it. Finally, this experiment complements individual observations on the dataset, indicating that CAER and, subsequently, CAER-S do not have higher context variability than EMOTIC.

3.4.2. Body Keypoints Visibility

Another essential cue that is present in these datasets is the body. Even though only specific datasets such as iMiGUE have annotated body language (see Table 3), scientists and engineers can extract insights from this cue by using different types of approaches. Ref. [19], for example, employed a body encoding stream that used a simple model proposed by [82] until its last convolutional layer, allowing the network to correlate the internal representations of this model and emotion. After that, more robust approaches were proposed, which focused on an activity recognition pipeline, such as [20,56]. We propose an experiment to assess the visibility of the body keypoints on images from three datasets: EMOTIC and CAER, which are the main datasets used today in the literature, and also BoLD, given its body language focus.

In this experiment, we loop through every image of the datasets, except for BoLD for which we choose a single random frame from every video. This different approach is motivated by some reasons, such as (a) there is low variability between frames in each video of BoLD and (b) the high number of recorded frames would make this experiment unfeasible due to the high processing time. Please note that, although BoLD has ground-truth body keypoint annotation, we chose not to use it in our evaluation to keep a comparable basis regarding body keypoint visibility.

We use you only look once (YOLO) [83] for each image or video frame to detect the people present on the scene. EMOTIC has ground-truth annotations regarding this aspect. Again, we chose not to use them in our evaluation to keep a comparable basis among datasets. For each detected person, we feed the cropped region to MediaPipe [84] to extract the detected pose. We empirically chose MediaPipe for this experiment due to its capability of also predicting keypoint visibility. Next, we sum the visibility of each of the 33 keypoints for each dataset. Finally, we plot these points using a spatial distribution representing a neutral pose body. For the drawing, we normalized the keypoint visibility regarding the maximum value stored to allow better comparison between points in the same and different datasets. We show the results of this experiment in Figure 7.

From the results of this experiment, we can notice that CAER-S is the dataset with the least visibility regarding the lower body parts. Given how the dataset is built upon video clips from television shows, it is common to have the lead performer centered on the screen and only show the lower body parts when relevant to the story. BoLD, although being focused on body language, also has less visibility for the lower body parts when compared to EMOTIC. Please note that BoLD provides body annotations and that this result may differ when theses annotations are used. Finally, EMOTIC has a well-distributed visibility of the whole body, which is justified by the multiple camera placements present in the dataset. All of the datasets have good visibility for the face, hands, and arms.

4. Discussion

Scientists and engineers currently use all of the datasets listed in this work to develop emotion recognition models and pipelines. However, in which scenarios could techniques trained using these datasets be deployed? Models, no matter how deep or robust, are still limited by the quality and availability of data. Therefore, based on features extracted from these datasets, we propose a discussion regarding application scenarios and gaps that we could expect in these approaches. Finally, before working on the construction and deployment of the technology, one should compare the requirements with the details described in each dataset above. Researchers should also know that these datasets were not built toward a specific requirement. The best practice is to have a dataset to fine-tune techniques for each specific need. We discuss below the possible application scenarios and limitations of each dataset, while also summarizing this discussion in Table 5.

4.1. Application Scenarios

Smart environments and spaces. A trending topic in the literature is extracting semantics-related people to improve decision making and extract insights from public and private spaces. Urbanists have been working on managing public spaces and proposing interventions based on how the local community relates to that place. Usually, this is performed in person by observing the region and registering the usage by residents. Researchers have been proposing the usage of pedestrian tracking [85] to understand this relation and plan interventions. This scenario could be improved by understanding emotion. Other research has also been proposed with the objective of correlating emotions and environments [9,10,12]. In these scenarios, surveillance cameras that are already installed could be reused to lower costs, for example. These cameras are usually placed with a top-down view, in which one can see the entire person, but only sometimes can their face be seen with clarity. In these scenarios, one should not depend too much on facial expression recognition but focus on other cues to extract emotion. Context is also not very representative since, in these views, the context is not prone to change and would usually only be the floor in the scene. GroupWalk and BoLD could be suggested for these application scenarios. According to the need for emotional representation, GroupWalk could be used to train or evaluate the model or pipeline, bearing in mind that the image quality could vary between recordings. The body language annotations in BoLD could be useful when a more detailed emotional representation is needed. For example, how groups of people are behaving in that region of the environment.

Another possibility is related to retail or private businesses. Decision makers could place cameras with a more frontal view of people and explore the facial cues, for example. Cameras placed on the cashier and on the store exit could extract information about the user’s experience in their store. For this case, AfeW, CAER, and CAER-S could be employed, given that they will always have a good facial crop of the performers in the set and that the body will only sometimes be evident, and because the context would not vary much. In cases where context variability is expected, EMOTIC could be helpful, given the different types of semantics that can be extracted from it. Furthermore, training on this dataset could help towards a robust model in cases where faces will not be visible.

Affective and assistive robotics. Another topic that has gained interest in the literature is assistive robotics. Currently, we have tasks from competitions in which the robot should follow someone around and help them with some tasks. However, these could also be extended into a more affective sense, allowing the development of affective robots to help people who need special care.

In [86], the authors proposed CuDDLer, a social robot that was designed as a companion for children. The robot can extract a child’s emotions and react accordingly. For example, suppose the child is sad or even crying. In that case, the robot could respond with emotional reactions to help build positive feelings. In this case, affective robots could be powered with multi-cue emotion recognition models that would help them respond better by understanding the context around them. Models for this task could be trained using EMOTIC to allow context encoding and semantics representation [66]. Of course, one cannot expect a child or a person that must be accompanied to look at the robot all of the time, so techniques must be robust to facial occlusion.

Emotion tracking and mental health. Systems that track mood and emotion are more common today, mainly due to the recent discussions regarding the importance of mental health, conditions that have increased recently due to the COVID-19 outbreak, which generated outcomes that could last for years. Although clinicians can give mental support through consultations, patients can sometimes feel uncomfortable exposing their feelings verbally [87]. Telemedicine systems that monitor a patient’s well-being could be expanded to also monitor their moods and emotion over time. EMOTIC could also be employed here, given its high context availability and people who are not always facing the camera, to track mood from photos from the gallery and cameras placed in the person’s house. For example, the techniques could benefit from a high-level context description to correlate house objects with emotions.

4.2. Limitations

Specific tasks. Two of the datasets surveyed in this work (iMiGUE and GroupWalk) were proposed to solve specific tasks. How iMiGUE data was sampled places it close to the optimal scenario of extracting emotion from interviews. However, the simplified category representation used by the authors limits techniques related to the overall perception of emotion, narrowing them down to only “positive” or “negative” representations. The identity-free aspect of the dataset also negatively impacts how techniques could learn to represent these other emotion categories since, for example, the facial expression is hidden for protection. Therefore, the dataset limits the techniques to a specific task, which could not be expanded for similar applications such as assessing comfort in job interviews or evaluating the quality of customer care services, due to these imposed restrictions.

Other limitations are present in the GroupWalk dataset, related to sample quality and image resolution. Although the variation in camera placements is a feature that would add robustness to techniques, in some cases, the camera appears to be handheld. We can notice distortions in the videos, probably caused by stabilization software to remove motion blur. Besides the noise and low image quality in some samples, bounding boxes are also annotated directly to the image. Researchers and engineers have been working with data augmentation and proved that employing techniques such as random crops, rotations, and noise can improve the overall performance of the image. However, when these artifacts are written directly to the image, the accuracy of the models could be harmed.

We still lack datasets for other specific tasks in the literature. For example, for in-the-wild scenarios, a viewpoint from a security camera would be ideal since it would allow the reuse of this footage for emotion recognition. A possibility would be to augment the data to be close to this scenario; however, this is difficult when using 2D data such as images and 2D body landmarks.

Labels. Another limitation in this field of study is related to the labeling of datasets. From this survey, we show that there is at least one significant problem related to the labeling of data that could be harmful to techniques. EMOTIC and BoLD, datasets that employ the annotation format of 26 categories, do not add a neutrality class but rather rely on categories such as Disconnection or Peace that could also be relatable to Neutral. However, two problems arise with this approach: first, as we discussed before, visible traits of the person sampled could lead to a correlation with both positive and negative emotions. Second, the high number of possible categories has a direct impact on annotators’ agreement.

On the other hand, the CAER and CAER-S datasets employ an annotation format of seven categories, one related to neutrality. However, their annotations are not as structured as in EMOTIC, in which one can access individual emotions and demographics for each person in the image. In multiple cases, more than one person is present in the scene, and it is difficult to correlate the annotated sample.

Lack of consistency in labeling. We also show that there is little concordance among datasets related to which classes should be employed, as we show in Table 6. This is a problem because it limits the possible datasets that one may use regarding a specific application scenario. For example, for an application more closely related to iMiGUE or CAER-S, one may not be able to extract representations related to Suffering, which is present in EMOTIC or BoLD.

We may argue that this could be achieved through transfer learning approaches such as fine-tuning the model to be able to generate predictions related to a dataset after learning representations from other datasets. Although there is no domain mismatch in this scenario, the data augmentations that are performed by techniques could be harmful to performance. As [88] discusses, several pre-training factors impact the accuracy of a model, such as data augmentation and the amount of data used in the pre-training. Fine-tuning can also lead to overfitting [89] and catastrophic forgetting [90,91], as well as other problems related to training procedures, such as the investment of resources. Finally, the argument that one may pre-train using dataset X to learn their representations and then fine-tune using dataset Y to achieve that set of categories is not sustained by the current literature, and more experiments are needed in order to validate this hypothesis.

Lack of ontologies to represent emotion. An ontology defines a set of primitives related to a specific domain of knowledge. Although there exist ontologies related to emotions, such as MFOEM [92], they are not used today as a baseline to define the emotion categories and what they represent. For example, Surprise, which MFOEM defines as an emotion caused by encountering unexpected events, can have a positive and a negative modifier regarding the person’s evaluation of the event, something that is not present in datasets today. For EMOTIC and BoLD, we have emotions that are closely related, such as Happiness and Pleasure. However, MFOEM suggests that Pleasure can also be correlated with satisfaction, which could help differentiate between these two classes more deeply. Therefore, by employing ontologies for emotion representation, one could merge the wide range of used label sets and have ground-truth descriptions of them from the literature.

5. Conclusions

In this work, we present an evaluation of datasets for vision-based emotion recognition, focusing on datasets that allow more than one visible cue simultaneously. We propose experiments to explore the data and correlate it with findings from the psychology literature.

From these experiments, among other insights and highlights, we show that the difference between the two datasets most applied in the state-of-the-art techniques is significant, due to the many different features employed in them. The CAER and CAER-S datasets provide a more stable benchmark, with images with faces centered in relation to the camera. In contrast, EMOTIC provides a benchmark that is more unpredictable. Both datasets employ an annotation format that can be prejudicial on some level, according to the planned application scenario. EMOTIC uses 26 categories to represent emotion, which does not include neutrality. In this case, the high number of possible categories could impact annotator agreement and harm some applications. While CAER uses a reduced set of categories, comprised of Ekman’s six emotional categories and a neutrality label, they also lack descriptions regarding to whom that annotation was attached in the scene.

We also evaluate datasets that are less common in the literature, such as BoLD, iMiGUE, and GroupWalk, which are focused on specific cues and that could be very aligned with some application scenarios. Finally, we discuss these scenarios and suggest approaches based on findings from these datasets to help guide future research and deployment.

With this work, we expect to provide a roadmap for upcoming research and experimentation in emotion recognition under real-world conditions. By presenting our proposed investigations and experiments, we hope to assist decision makers in selecting the most suitable dataset for training and validation. Furthermore, deep learning often demands a significant amount of energy, which can have adverse environmental effects. Thus, we anticipate that this guidance will enable informed decisions regarding data quality and availability, mitigating the necessity of training or evaluating using multiple datasets to validate concepts.

Author Contributions

Conceptualization, W.C. and E.T.; methodology, W.C. and E.T.; software, W.C. and R.O.; validation, W.C., J.M.T. and J.P.L.; formal analysis, W.C.; investigation, W.C. and R.O.; resources, L.F. and V.T.; data curation, W.C. and R.O.; writing—original draft preparation, W.C.; writing—review and editing, E.T., J.M.T. and J.P.L.; visualization, W.C. and E.T.; supervision, E.T., L.F. and V.T.; project administration, W.C., L.F. and V.T.; funding acquisition, V.T. All authors have read and agreed to the published version of the manuscript.

Funding

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001 and by Conselho Nacional de Desenvolvimento Científico e Tecnológico (process 422728/2021-7).

Data Availability Statement

The source code for the experiments is available at: https://github.com/wlcosta/2023-a-survey-applsci (accessed on 4 April 2023). The user will need to provide the paths to the datasets in their own environment to be able to reproduce the experiments, due to license limitations from the datasets. We list the homepage of each dataset below. Please note that the authors of this work are not associated with these research groups and, therefore, cannot respond to any questions regarding data availability. Used datasets: EMOTIC (https://s3.sunai.uoc.edu/emotic/index.html, accessed on 4 April 2023), CAER and CAER-S (https://caer-dataset.github.io/, accessed on 4 April 2023), iMiGUE (https://github.com/linuxsino/iMiGUE, accessed on 4 April 2023), AfeW (https://cs.anu.edu.au/few/AFEW.html, accessed on 4 April 2023), AfeW-VA (https://ibug.doc.ic.ac.uk/resources/afew-va-database/, accessed on 4 April 2023), BoLD (https://cydar.ist.psu.edu/emotionchallenge/index.php, accessed on 4 April 2023), GroupWalk (https://gamma.umd.edu/software/, accessed on 4 April 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Sternglanz, R.W.; DePaulo, B.M. Reading nonverbal cues to emotions: The advantages and liabilities of relationship closeness. J. Nonverbal Behav. 2004, 28, 245–266. [Google Scholar] [CrossRef]
Rouast, P.V.; Adam, M.T.; Chiong, R. Deep learning for human affect recognition: Insights and new developments. IEEE Trans. Affect. Comput. 2019, 12, 524–543. [Google Scholar] [CrossRef]
Patel, D.S. Body Language: An Effective Communication Tool. IUP J. Engl. Stud. 2014, 9, 7. [Google Scholar]
Wallbott, H.G.; Scherer, K.R. Cues and channels in emotion recognition. J. Personal. Soc. Psychol. 1986, 51, 690. [Google Scholar] [CrossRef]
Archer, D.; Akert, R.M. Words and everything else: Verbal and nonverbal cues in social interpretation. J. Personal. Soc. Psychol. 1977, 35, 443. [Google Scholar] [CrossRef]
Barrett, L.F.; Kensinger, E.A. Context is routinely encoded during emotion perception. Psychol. Sci. 2010, 21, 595–599. [Google Scholar] [CrossRef] [PubMed]
Barrett, L.F.; Mesquita, B.; Gendron, M. Context in emotion perception. Curr. Dir. Psychol. Sci. 2011, 20, 286–290. [Google Scholar] [CrossRef]
Guthier, B.; Alharthi, R.; Abaalkhail, R.; El Saddik, A. Detection and visualization of emotions in an affect-aware city. In Proceedings of the 1st International Workshop on Emerging Multimedia Applications and Services for Smart Cities, Orlando, FL, USA, 7 November 2014; pp. 23–28. [Google Scholar]
Aerts, R.; Honnay, O.; Van Nieuwenhuyse, A. Biodiversity and human health: Mechanisms and evidence of the positive health effects of diversity in nature and green spaces. Br. Med. Bull. 2018, 127, 5–22. [Google Scholar] [CrossRef]
Wei, H.; Hauer, R.J.; Chen, X.; He, X. Facial expressions of visitors in forests along the urbanization gradient: What can we learn from selfies on social networking services? Forests 2019, 10, 1049. [Google Scholar] [CrossRef]
Wei, H.; Hauer, R.J.; Zhai, X. The relationship between the facial expression of people in university campus and host-city variables. Appl. Sci. 2020, 10, 1474. [Google Scholar] [CrossRef]
Meng, Q.; Hu, X.; Kang, J.; Wu, Y. On the effectiveness of facial expression recognition for evaluation of urban sound perception. Sci. Total Environ. 2020, 710, 135484. [Google Scholar] [CrossRef] [PubMed]
Dhall, A.; Goecke, R.; Joshi, J.; Hoey, J.; Gedeon, T. Emotiw 2016: Video and group-level emotion recognition challenges. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, Tokyo, Japan, 12–16 November 2016; pp. 427–432. [Google Scholar]
Su, W.; Zhang, H.; Su, Y.; Yu, J. Facial Expression Recognition with Confidence Guided Refined Horizontal Pyramid Network. IEEE Access 2021, 9, 50321–50331. [Google Scholar] [CrossRef]
Wang, K.; Peng, X.; Yang, J.; Lu, S.; Qiao, Y. Suppressing uncertainties for large-scale facial expression recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6897–6906. [Google Scholar]
Kosti, R.; Alvarez, J.M.; Recasens, A.; Lapedriza, A. Context based emotion recognition using emotic dataset. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2755–2766. [Google Scholar] [CrossRef] [PubMed]
Lee, J.; Kim, S.; Kim, S.; Park, J.; Sohn, K. Context-aware emotion recognition networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10143–10152. [Google Scholar]
Le, N.; Nguyen, K.; Nguyen, A.; Le, B. Global-local attention for emotion recognition. Neural Comput. Appl. 2022, 34, 21625–21639. [Google Scholar] [CrossRef]
Costa, W.; Macêdo, D.; Zanchettin, C.; Talavera, E.; Figueiredo, L.S.; Teichrieb, V. A Fast Multiple Cue Fusing Approach for Human Emotion Recognition. SSRN Preprint 4255748. 2022. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4255748 (accessed on 5 April 2023).
Chen, J.; Yang, T.; Huang, Z.; Wang, K.; Liu, M.; Lyu, C. Incorporating structured emotion commonsense knowledge and interpersonal relation into context-aware emotion recognition. Appl. Intell. 2022, 53, 4201–4217. [Google Scholar] [CrossRef]
Saxena, A.; Khanna, A.; Gupta, D. Emotion recognition and detection methods: A comprehensive survey. J. Artif. Intell. Syst. 2020, 2, 53–79. [Google Scholar] [CrossRef]
Zepf, S.; Hernandez, J.; Schmitt, A.; Minker, W.; Picard, R.W. Driver emotion recognition for intelligent vehicles: A survey. ACM Comput. Surv. (CSUR) 2020, 53, 1–30. [Google Scholar] [CrossRef]
Canal, F.Z.; Müller, T.R.; Matias, J.C.; Scotton, G.G.; de Sa Junior, A.R.; Pozzebon, E.; Sobieranski, A.C. A survey on facial emotion recognition techniques: A state-of-the-art literature review. Inf. Sci. 2022, 582, 593–617. [Google Scholar] [CrossRef]
Veltmeijer, E.A.; Gerritsen, C.; Hindriks, K. Automatic emotion recognition for groups: A review. IEEE Trans. Affect. Comput. 2021, 14, 89–107. [Google Scholar] [CrossRef]
Khan, M.A.R.; Rostov, M.; Rahman, J.S.; Ahmed, K.A.; Hossain, M.Z. Assessing the Applicability of Machine Learning Models for Robotic Emotion Monitoring: A Survey. Appl. Sci. 2023, 13, 387. [Google Scholar] [CrossRef]
Thanapattheerakul, T.; Mao, K.; Amoranto, J.; Chan, J.H. Emotion in a century: A review of emotion recognition. In Proceedings of the 10th International Conference on Advances in Information Technology, Bangkok, Thailand, 10–13 December 2018; pp. 1–8. [Google Scholar]
Birdwhistell, R.L. Introduction to Kinesics: An Annotation System for Analysis of Body Motion and Gesture; Department of State, Foreign Service Institute: Arlington, VA, USA, 1952. [Google Scholar]
Frank, L.K. Tactile Communication. ETC Rev. Gen. Semant. 1958, 16, 31–79. [Google Scholar]
Hall, E.T. A System for the Notation of Proxemic Behavior. Am. Anthropol. 1963, 65, 1003–1026. [Google Scholar] [CrossRef]
Darwin, C. The Expression of the Emotions in Man and Animals; John Marry: London, UK, 1872. [Google Scholar]
Ekman, P. Facial expression and emotion. Am. Psychol. 1993, 48, 384. [Google Scholar] [CrossRef] [PubMed]
Wallbott, H.G. Bodily expression of emotion. Eur. J. Soc. Psychol. 1998, 28, 879–896. [Google Scholar] [CrossRef]
Tracy, J.L.; Matsumoto, D. The spontaneous expression of pride and shame: Evidence for biologically innate nonverbal displays. Proc. Natl. Acad. Sci. USA 2008, 105, 11655–11660. [Google Scholar] [CrossRef]
Keltner, D. Signs of appeasement: Evidence for the distinct displays of embarrassment, amusement, and shame. J. Personal. Soc. Psychol. 1995, 68, 441. [Google Scholar] [CrossRef]
Tassinary, L.G.; Cacioppo, J.T. Unobservable facial actions and emotion. Psychol. Sci. 1992, 3, 28–33. [Google Scholar] [CrossRef]
Ekman, P.; Roper, G.; Hager, J.C. Deliberate facial movement. Child Dev. 1980, 51, 886–891. [Google Scholar] [CrossRef]
Ekman, P.; O’Sullivan, M.; Friesen, W.V.; Scherer, K.R. Invited article: Face, voice, and body in detecting deceit. J. Nonverbal Behav. 1991, 15, 125–135. [Google Scholar] [CrossRef]
Greenaway, K.H.; Kalokerinos, E.K.; Williams, L.A. Context is everything (in emotion research). Soc. Personal. Psychol. Compass 2018, 12, e12393. [Google Scholar] [CrossRef]
Van Kleef, G.A.; Fischer, A.H. Emotional collectives: How groups shape emotions and emotions shape groups. Cogn. Emot. 2016, 30, 3–19. [Google Scholar] [CrossRef] [PubMed]
Aviezer, H.; Hassin, R.R.; Ryan, J.; Grady, C.; Susskind, J.; Anderson, A.; Moscovitch, M.; Bentin, S. Angry, disgusted, or afraid? Studies on the malleability of emotion perception. Psychol. Sci. 2008, 19, 724–732. [Google Scholar] [CrossRef] [PubMed]
Mollahosseini, A.; Hasani, B.; Mahoor, M.H. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 2017, 10, 18–31. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Erhan, D.; Carrier, P.L.; Courville, A.; Mirza, M.; Hamner, B.; Cukierski, W.; Tang, Y.; Thaler, D.; Lee, D.H.; et al. Challenges in representation learning: A report on three machine learning contests. In Proceedings of the Neural Information Processing: 20th International Conference, ICONIP 2013, Proceedings, Part III 20, Daegu, Republic of Korea, 3–7 November 2013; Springer: Cham, Swizterland, 2013; pp. 117–124. [Google Scholar]
Lucey, P.; Cohn, J.F.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA, 13–18 June 2010; pp. 94–101. [Google Scholar]
Savchenko, A.V.; Savchenko, L.V.; Makarov, I. Classifying emotions and engagement in online learning based on a single facial expression recognition neural network. IEEE Trans. Affect. Comput. 2022, 13, 2132–2143. [Google Scholar] [CrossRef]
Kollias, D.; Zafeiriou, S. Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface. arXiv 2019, arXiv:1910.04855. [Google Scholar]
Wen, Z.; Lin, W.; Wang, T.; Xu, G. Distract your attention: Multi-head cross attention network for facial expression recognition. arXiv 2021, arXiv:2109.07270. [Google Scholar]
Antoniadis, P.; Filntisis, P.P.; Maragos, P. Exploiting Emotional Dependencies with Graph Convolutional Networks for Facial Expression Recognition. In Proceedings of the 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), Jodhpur, India, 15–18 December 2021. [Google Scholar] [CrossRef]
Ryumina, E.; Dresvyanskiy, D.; Karpov, A. In search of a robust facial expressions recognition model: A large-scale visual cross-corpus study. Neurocomputing 2022, 514, 435–450. [Google Scholar] [CrossRef]
Aouayeb, M.; Hamidouche, W.; Soladie, C.; Kpalma, K.; Seguier, R. Learning vision transformer with squeeze and excitation for facial expression recognition. arXiv 2021, arXiv:2107.03107. [Google Scholar]
Meng, D.; Peng, X.; Wang, K.; Qiao, Y. Frame attention networks for facial expression recognition in videos. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 3866–3870. [Google Scholar]
Dhall, A.; Goecke, R.; Lucey, S.; Gedeon, T. Acted Facial Expressions in the Wild Database; Technical Report TR-CS-11; Australian National University: Canberra, Australia, 2011; Volume 2, p. 1. [Google Scholar]
Kossaifi, J.; Tzimiropoulos, G.; Todorovic, S.; Pantic, M. AFEW-VA database for valence and arousal estimation in-the-wild. Image Vis. Comput. 2017, 65, 23–36. [Google Scholar] [CrossRef]
Kosti, R.; Alvarez, J.M.; Recasens, A.; Lapedriza, A. EMOTIC: Emotions in Context dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 61–69. [Google Scholar]
Liu, X.; Shi, H.; Chen, H.; Yu, Z.; Li, X.; Zhao, G. iMiGUE: An identity-free video dataset for micro-gesture understanding and emotion analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10631–10642. [Google Scholar]
Luo, Y.; Ye, J.; Adams, R.B.; Li, J.; Newman, M.G.; Wang, J.Z. ARBEE: Towards automated recognition of bodily expression of emotion in the wild. Int. J. Comput. Vis. 2020, 128, 1–25. [Google Scholar] [CrossRef]
Mittal, T.; Guhan, P.; Bhattacharya, U.; Chandra, R.; Bera, A.; Manocha, D. Emoticon: Context-aware multimodal emotion recognition using frege’s principle. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14234–14243. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Cham, Swizterland, 2014; pp. 740–755. [Google Scholar]
Zhou, B.; Zhao, H.; Puig, X.; Xiao, T.; Fidler, S.; Barriuso, A.; Torralba, A. Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vis. 2019, 127, 302–321. [Google Scholar] [CrossRef]
Zuckerman, M.; Hall, J.A.; DeFrank, R.S.; Rosenthal, R. Encoding and decoding of spontaneous and posed facial expressions. J. Personal. Soc. Psychol. 1976, 34, 966. [Google Scholar] [CrossRef]
Gu, C.; Sun, C.; Ross, D.A.; Vondrick, C.; Pantofaru, C.; Li, Y.; Vijayanarasimhan, S.; Toderici, G.; Ricco, S.; Sukthankar, R.; et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6047–6056. [Google Scholar]
Dhall, A.; Goecke, R.; Joshi, J.; Wagner, M.; Gedeon, T. Emotion recognition in the wild challenge 2013. In Proceedings of the 15th ACM on International Conference on Multimodal Interaction, Sydney, Australia, 9–13 December 2013; pp. 509–516. [Google Scholar]
Wu, J.; Zhang, Y.; Ning, L. The Fusion Knowledge of Face, Body and Context for Emotion Recognition. In Proceedings of the 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shanghai, China, 8–12 July 2019; pp. 108–113. [Google Scholar]
Zhang, M.; Liang, Y.; Ma, H. Context-aware affective graph reasoning for emotion recognition. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 151–156. [Google Scholar]
Thuseethan, S.; Rajasegarar, S.; Yearwood, J. Boosting emotion recognition in context using non-target subject information. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–7. [Google Scholar]
Peng, K.; Roitberg, A.; Schneider, D.; Koulakis, M.; Yang, K.; Stiefelhagen, R. Affect-DML: Context-Aware One-Shot Recognition of Human Affect using Deep Metric Learning. In Proceedings of the 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), Jodhpur, India, 15–18 December 2021; pp. 1–8. [Google Scholar]
Wu, S.; Zhou, L.; Hu, Z.; Liu, J. Hierarchical Context-Based Emotion Recognition with Scene Graphs. IEEE Trans. Neural Netw. Learn. Syst. 2022, 1–15. [Google Scholar] [CrossRef] [PubMed]
Yang, D.; Huang, S.; Wang, S.; Liu, Y.; Zhai, P.; Su, L.; Li, M.; Zhang, L. Emotion Recognition for Multiple Context Awareness. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–24 October 2022; Springer: Cham, Swizterland, 2022; pp. 144–162. [Google Scholar]
Gao, Q.; Zeng, H.; Li, G.; Tong, T. Graph reasoning-based emotion recognition network. IEEE Access 2021, 9, 6488–6497. [Google Scholar] [CrossRef]
Zhao, Z.; Liu, Q.; Zhou, F. Robust lightweight facial expression recognition network with label distribution training. AAAI Conf. Artif. Intell. 2021, 35, 3510–3519. [Google Scholar] [CrossRef]
Zhao, Z.; Liu, Q.; Wang, S. Learning deep global multi-scale and local attention features for facial expression recognition in the wild. IEEE Trans. Image Process. 2021, 30, 6544–6556. [Google Scholar] [CrossRef]
Zhou, S.; Wu, X.; Jiang, F.; Huang, Q.; Huang, C. Emotion Recognition from Large-Scale Video Clips with Cross-Attention and Hybrid Feature Weighting Neural Networks. Int. J. Environ. Res. Public Health 2023, 20, 1400. [Google Scholar] [CrossRef]
Ekman, P. An argument for basic emotions. Cogn. Emot. 1992, 6, 169–200. [Google Scholar] [CrossRef]
Said, C.P.; Sebe, N.; Todorov, A. Structural resemblance to emotional expressions predicts evaluation of emotionally neutral faces. Emotion 2009, 9, 260. [Google Scholar] [CrossRef]
Montepare, J.M.; Dobish, H. The contribution of emotion perceptions and their overgeneralizations to trait impressions. J. Nonverbal Behav. 2003, 27, 237–254. [Google Scholar] [CrossRef]
Fleiss, J.L. Measuring nominal scale agreement among many raters. Psychol. Bull. 1971, 76, 378. [Google Scholar] [CrossRef]
Mehrabian, A. Basic Dimensions for a General Psychological Theory: Implications for Personality, Social, Environmental, and Developmental Studies; Oelgeschlager, Gunn & Hain: Cambridge, MA, USA, 1980. [Google Scholar]
Kołakowska, A.; Szwoch, W.; Szwoch, M. A review of emotion recognition methods based on data acquired via smartphone sensors. Sensors 2020, 20, 6367. [Google Scholar] [CrossRef] [PubMed]
Dhall, A.; Goecke, R.; Lucey, S.; Gedeon, T. Collecting large, richly annotated facial-expression databases from movies. IEEE Multimed. 2012, 19, 34–41. [Google Scholar] [CrossRef]
Pandey, R.; Purohit, H.; Castillo, C.; Shalin, V.L. Modeling and mitigating human annotation errors to design efficient stream processing systems with human-in-the-loop machine learning. Int. J. Hum. Comput. Stud. 2022, 160, 102772. [Google Scholar] [CrossRef]
López-Cifuentes, A.; Escudero-Viñolo, M.; Bescós, J.; García-Martín, Á. Semantic-Aware Scene Recognition. Pattern Recognit. 2020, 102, 107256. [Google Scholar] [CrossRef]
Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; Torralba, A. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1452–1464. [Google Scholar] [CrossRef] [PubMed]
Xiao, B.; Wu, H.; Wei, Y. Simple baselines for human pose estimation and tracking. In Proceedings of the the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 466–481. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.L.; Yong, M.; Lee, J.; et al. MediaPipe: A Framework for Perceiving and Processing Reality. In Proceedings of the Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR) 2019, Long Beach, CA, USA, 17 June 2019. [Google Scholar]
Lima, J.P.; Roberto, R.; Figueiredo, L.; Simões, F.; Thomas, D.; Uchiyama, H.; Teichrieb, V. 3D pedestrian localization using multiple cameras: A generalizable approach. Mach. Vis. Appl. 2022, 33, 61. [Google Scholar] [CrossRef]
Limbu, D.K.; Anthony, W.C.Y.; Adrian, T.H.J.; Dung, T.A.; Kee, T.Y.; Dat, T.H.; Alvin, W.H.Y.; Terence, N.W.Z.; Ridong, J.; Jun, L. Affective social interaction with CuDDler robot. In Proceedings of the 2013 6th IEEE Conference on Robotics, Automation and Mechatronics (RAM), Manila, Philippines, 12–15 November 2013; pp. 179–184. [Google Scholar]
Busch, A.B.; Sugarman, D.E.; Horvitz, L.E.; Greenfield, S.F. Telemedicine for treating mental health and substance use disorders: Reflections since the pandemic. Neuropsychopharmacology 2021, 46, 1068–1070. [Google Scholar] [CrossRef]
Zoph, B.; Ghiasi, G.; Lin, T.Y.; Cui, Y.; Liu, H.; Cubuk, E.D.; Le, Q. Rethinking pre-training and self-training. Adv. Neural Inf. Process. Syst. 2020, 33, 3833–3845. [Google Scholar]
Li, D.; Zhang, H. Improved regularization and robustness for fine-tuning in neural networks. Adv. Neural Inf. Process. Syst. 2021, 34, 27249–27262. [Google Scholar]
Chen, X.; Wang, S.; Fu, B.; Long, M.; Wang, J. Catastrophic forgetting meets negative transfer: Batch spectral shrinkage for safe transfer learning. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Xu, Y.; Zhong, X.; Yepes, A.J.J.; Lau, J.H. Forget me not: Reducing catastrophic forgetting for domain adaptation in reading comprehension. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]
Hastings, J.; Ceusters, W.; Smith, B.; Mulligan, K. Dispositions and processes in the Emotion Ontology. In Proceedings of the 2nd International Conference on Biomedical Ontology, Buffalo, NY, USA, 26–30 July 2011. [Google Scholar]

Figure 1. Datasets currently used for the task of emotion recognition. Samples were extracted directly from the dataset, except for (f), which was extracted from their arXiv manuscript with a CC-BY-SA license since the download link for the dataset is currently offline. Although some datasets explicitly show faces, such as (a,c), others, such as (b,d) have samples with severe occlusion, given their focus on other nonverbal cues.

Figure 2. Examples from the CAER-S dataset with more than one person in the image: Angry (a); Disgust (b); Happy (c); Sad (d).

Figure 3. An overview of the annotations on EMOTIC’s test set and the concordance of the annotators. Each cell contains the co-occurrence between the pair of emotions.

Figure 4. Visualization of the continuous annotations for EMOTIC. An interactive visualization is available in the GitHub repository associated with this publication.

Figure 5. Examples from the EMOTIC dataset [16,53] with images with severe facial occlusion. For this dataset, it is expected that techniques can extract information from context. Therefore, this is not an issue but rather a characteristic. Each column contains samples from a subset. The images were padded to allow better visualization: emodb (a); ade20k (b); framesdb (c); mscoco (d).

Figure 6. A bar chart with the occurrence of the classifications of each image from CAER-S (red) and EMOTIC (blue) with the scene recognition approach proposed by [80]. The difference in the height of each bar indicates the difference in the number of samples of each dataset. A dataset can be considered balanced if the samples are distributed equally among the classes. For EMOTIC, we can see that this behavior is more visible than for CAER-S.

Figure 7. Visibility of each individual body joint as keypoints for the datasets evaluated in this experiment: EMOTIC (a); CAER-S (b); BoLD (c). From top to bottom, we have face, arms and hands, legs and feet, following the 33 pose landmarks defined by Mediapipe and available: https://github.com/google/mediapipe/blob/master/docs/solutions/pose.md (accessed on 4 April 2023). The size and heatmap color of each sampled point represent the visibility of the keypoint.

Table 1. Body movements observed with emotions [30,32].

Emotion	Description of Body Movements
Joy	Various purposeless movements, jumping, dancing for joy, clapping of hands, stamping while laughing, head nods to and fro, during excessive laughter the whole body is thrown backward and shakes or is almost convulsed, body held erect and head upright (pp. 76, 196, 197, 200, 206, 210, 214)
Sadness	Motionless, passive, head hangs on contracted chest (p. 176)
Pride	Head and body held erect (p. 263)
Shame	Turning away the whole body, more especially the face, avert, bend down, awkward, nervous movements (pp. 320, 328, 329)
Fear	Head sinks between shoulders, motionless or crouches down (pp. 280, 290), convulsive movements, hands alternately clenched and opened with twitching movement, arms thrown wildly over the head, the whole body often turned away or shrinks, arms violently protruded as if to push away, raising both shoulders with the bent arms pressed closely against sides or chest (pp. 291, 305)
Anger/rage	Whole body trembles, intend to push or strike violently away, inanimate objects struck or dashed to the ground, gestures become purposeless or frantic, pacing up and down, shaking fist, head erect, chest well expanded, feet planted firmly on the ground, one or both elbows squared or arms rigidly suspended by the sides, fists are clenched, shoulders squared (pp. 74, 239, 243, 245, 271, 361)
Disgust	Gestures as if to push away or to guard oneself, spitting, arms pressed close to the sides, shoulders raised as when horror is experienced (pp. 257, 260)
Contempt	Turning away of the whole body, snapping one’s fingers (pp. 254, 255, 256)

Table 3. State-of-the-art techniques, published from May 2019 until January 2023, and the datasets they used for evaluation. In case of duplicates, such as [18], both datasets were used in the evaluation.

Approach	Available From	Dataset
Wu et al. [62]	May 2019	EMOTIC
Zhang et al. [63]	August 2019	EMOTIC
Mittal et al. [56]	June 2020	EMOTIC
Thuseethan et al. [64]	September 2021	EMOTIC
Peng et al. [65]	December 2021	EMOTIC
Wu et al. [66]	August 2022	EMOTIC
Yang et al. [67]	October 2022	EMOTIC
Lee et al. [17]	October 2019	CAER
Lee et al. [17]	October 2019	CAER-S
Gao et al. [68]	January 2021	CAER-S
Zhao et al. [69]	May 2021	CAER-S
Zhao et al. [70]	July 2021	CAER-S
Le et al. [18]	December 2021	CAER-S
Le et al. [18]	December 2021	NCAER-S ¹
Wu et al. [66]	August 2022	CAER-S
Chen et al. [20]	October 2022	CAER-S
Costa et al. [19]	October 2022	CAER-S
Zhou et al. [71]	January 2023	CAER-S

¹ A variation of CAER-S proposed by these authors.

Table 4. Presence and visibility of nonverbal cues in the datasets currently employed by the state-of-the-art techniques. For each dataset (rows), we classify the cues (columns) as missing, somewhat present, or present, or annotated for when the cue is not only visible but annotated by humans.

Dataset Name	Cue
Dataset Name	Facial Expression	Context or Background	Body Language	Others
EMOTIC [16,53]	Somewhat present	Present	Somewhat present	N/A
CAER/CAER-S [17]	Present	Present	Somewhat present	N/A
iMiGUE [54]	Present	Missing	Annotated	Annotated microgestures
AfeW/AfeW-VA [51,52]	Present	Somewhat present	Somewhat present	N/A
BoLD [55]	Present	Present	Annotated	N/A
GroupWalk [56]	Somewhat present	Somewhat present	Somewhat present	Gait is visible but not annotated

Table 5. Summary of the suggested application scenarios of each dataset and their possible limitations.

Dataset	Suggested Application Scenarios	Possible Limitations
EMOTIC [16,53]	Images in which context is highly representative and/or highly variable. Content published on social networks. Scenes in which the face can be occluded. Images sampled from cameras aligned with the ground. Applications that require a deeper comprehension of sentiments instead of a direct emotion classification.	Lack of neutral classification may generate noise in some applications. Lack of ethnic representation means that techniques could be biased in some regions
CAER/CAER-S [17]	Images or videos in which the face will mostly be aligned to the camera. Interviews, one-on-one meetings, reactions, or other scenarios where the camera will be mounted at eye level. Feedback for retailers on individual spaces or services. Generic applications in which a direct classification is sufficient. Affective computing on notebooks, smartphones, tablets, or other mobile devices with a front-facing camera.	Prejudicial when the camera has a different viewpoint than the one present in the dataset (for example, top-down). Prejudicial when the person is not directly aligned to the camera. Lack of ethnic representation means that techniques could be biased in some regions.
iMiGUE [54]	Spontaneous scenarios in which subtle cues need to be considered. Interviews, one-on-one meetings, reactions, or other scenarios where the camera will be mounted at eye level. Applications with high reliance on body language and gestures.	By itself, the dataset can be harmful for scenarios in which facial expressions are considered important. Prejudicial when the person is not directly aligned to the camera. No direct emotion representation is present in the dataset.
AfeW/AfeW-VA [51,52]	Scenes with low context contribution. Affective computing on notebooks, smartphones, tablets, or other mobile devices with a front-facing camera. Feedback for retailers on individual spaces or services. Scenes in which subtle facial expressions do not need to be considered	Lack of ethnic representation means that techniques could be biased in some regions. Low image resolution might restrict models in learning subtle cues.
BoLD [55]	Images or videos with representative body language. Applications with a deeper emotion representation related to multiple cues simultaneously. Social interactions. Applications that are expected to work with people from diverse ethnicities. Images sampled from cameras aligned with the ground. Scenes in which the face can be occluded.	Low resolution in some samples could impact learning from facial expressions. Lack of neutral classification may generate noise in some applications.
GroupWalk [56]	Scenarios in which a distant emotional overview is sufficient. Security cameras. High-level descriptions of emotions in a (smart) space. Interactions in which more descriptive emotions are not needed.	Noise from the dataset could be prejudicial to the model. Gait is not always representative of emotion in this viewpoint. Poor presence of other nonverbal cues. Generic emotion categories.

Table 6. Emotion categories employed in the datasets reviewed in this work.

EMOTIC and BoLD	CAER and AfeW	iMiGUE	GroupWalk
1. Peace: well being and relaxed; no worry; having positive thoughts or sensations; satisfied 2. Affection: fond feelings; love; tenderness 3. Esteem: feelings of favorable opinion or judgment; respect; admiration; gratefulness 4. Anticipation: state of looking forward; hoping for or getting prepared for possible future events 5. Engagement: paying attention to something; absorbed into something; curious; interested 6. Confidence: feeling of being certain; conviction that an outcome will be favorable; encouraged; proud 7. Happiness: feeling delighted; feeling enjoyment or amusement 8. Pleasure: feeling of delight in the senses 9. Excitement: feeling enthusiasm; stimulated; energetic 10. Surprise: sudden discovery of something unexpected 11. Sympathy: state of sharing others emotions, goals or troubles; supportive; compassionate 12. Doubt/Confusion: difficulty to understand or decide; thinking about different options 13. Disconnection: feeling not interested in the main event of the surroundings; indifferent; bored; distracted 14. Fatigue: weariness; tiredness; sleepy 15. Embarrassment: feeling ashamed or guilty 16. Yearning: strong desire to have something; jealous; envious; lust 17. Disapproval: feeling that something is wrong or reprehensible; contempt; hostile 18. Aversion: feeling disgust, dislike, repulsion; feeling hate 19. Annoyance: bothered by something or someone; irritated; impatient; frustrated 20. Anger: intense displeasure or rage; furious; resentful 21. Sensitivity: feeling of being physically or emotionally wounded; feeling delicate or vulnerable 22. Sadness: feeling unhappy, sorrow, disappointed, or discouraged 23. Disquietment: nervous; worried; upset; anxious; tense; pressured; alarmed 24. Fear: feeling suspicious or afraid of danger, threat, evil or pain; horror 25. Pain: physical suffering 26. Suffering: psychological or emotional pain; distressed; anguished	1. Anger 2. Disgust 3. Fear 4. Happiness (Happy) 5. Sadness (Sad) 6. Surprise 7. Neutral	1. Positive 2. Negative	1. Somewhat happy 2. Extremely happy 3. Somewhat sad 4. Extremely sad 5. Somewhat angry 6. Extremely angry 7. Neutral

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Costa, W.; Talavera, E.; Oliveira, R.; Figueiredo, L.; Teixeira, J.M.; Lima, J.P.; Teichrieb, V. A Survey on Datasets for Emotion Recognition from Vision: Limitations and In-the-Wild Applicability. Appl. Sci. 2023, 13, 5697. https://doi.org/10.3390/app13095697

AMA Style

Costa W, Talavera E, Oliveira R, Figueiredo L, Teixeira JM, Lima JP, Teichrieb V. A Survey on Datasets for Emotion Recognition from Vision: Limitations and In-the-Wild Applicability. Applied Sciences. 2023; 13(9):5697. https://doi.org/10.3390/app13095697

Chicago/Turabian Style

Costa, Willams, Estefanía Talavera, Renato Oliveira, Lucas Figueiredo, João Marcelo Teixeira, João Paulo Lima, and Veronica Teichrieb. 2023. "A Survey on Datasets for Emotion Recognition from Vision: Limitations and In-the-Wild Applicability" Applied Sciences 13, no. 9: 5697. https://doi.org/10.3390/app13095697

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Survey on Datasets for Emotion Recognition from Vision: Limitations and In-the-Wild Applicability

Abstract

1. Introduction

2. Perceived Emotion in Humans

2.1. The Portrayal of Emotion

2.2. Association between Nonverbal Cues

3. Datasets used in State-of-the-Art Techniques

3.1. Techniques in the State of the Art

3.2. Emotion Categories and Annotations

Assessing Annotators’ Agreement

3.3. Continuous Annotations

3.4. Presence of Nonverbal Cues

3.4.1. Context Variability

3.4.2. Body Keypoints Visibility

4. Discussion

4.1. Application Scenarios

4.2. Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI