Furthermore, datasets are usually accompanied with annotations that are essential for training, testing and validating methods for facial expression recognition. These annotations are particularly relevant for videos where, depending on the fact the annotations are given at frame or videoevel, analysis at different granularity can be performed. This has a considerable impact depending on whether the datasets include posed, spontaneous or in-the-wild capturing, and on the expression model, either based on the six basic expressions or the circumplex model. In fact, while providing the six expressionabels for posed and spontaneous datasets is an easy task, some more difficulties are experienced when the circumplex model is adopted. For in-the-wild capturing, ground-truth annotations are provided offline, and require experienced annotators. This introduces aot of work from human annotators, which is costly and time-consuming. Sometimes, this human effort is alleviated by resorting to some form of Mechanical Turk that distributes theoad to low-experienced andow-cost workers. However, being performed by non-expert personnel, the resulting annotations can show a diminished accuracy being originated by averaging annotations across several mechanical workers.
2.1. Spontaneous Datasets
In this section, we focus on spontaneous macro-expression datasets. Some samples of these expressions are shown in
Figure 2. These datasets areisted in
Section 3.4.
EB+ (An expanded version of BP4D+): The EB+ [
24] dataset is an expanded version of BP4D+ [
25]. It contains videos from a total of 200 subjects: 140 subjects from BP4D+, plus 60 additional subjects associated with five to eight tasks that involve inductions of varied emotions of a participant interacting with an experimenter. The emotions are inducted when the participants interact with the experimenter. A certified FACS coders team annotated the dataset manually.
BP4D+ (Multimodal Spontaneous Emotion): Those tasks in EB+ are minutely explained in the BP4D+ or MultiModal Spontaneous Emotion (MMSE) dataset. This dataset is collected for human behavior analysis, and it illustrates 140 participants from different ethnic origins. The collected data included thermal (infrared) sensing, high-resolution 2D videos, high-resolution 3D dynamic imaging and contact physiological sensors that included respiration, heart rate, electrical conductivity of the skin and blood pressure. BP4D+ (see
Figure 3) presents ten different emotion categories (happiness or amusement, surprise, sadness, startle or surprise, skeptical, embarrassment, fear or nervous, physical pain, angry and disgust) recorded per person according to the ten tasks that each person experienced. More specifically, these tasks include:isten to a funny joke, watch 3D avatar of participants, listen to 911 emergency phone calls, experience a sudden burst of sound, response to true or false question, improvise a silly song, dart game, submerge hands into ice water, complained for a poor performance and smell a smelly odor. BP4D+ has aarger scale and variability for images than BP4D Spontaneous [
14]. Since its creation, BP4D+ has being widely used.
BP4D (Binghamton-Pittsburgh 3D DynAMIc Spontaneous Facial Expression Data-base): BP4D Spontaneous [
14] contains 41 participants from four different ethnic origins (Asian, African-American, Hispanic, and Euro-American). It presents eight emotions (happiness or amusement, sadness, surprise or startle, embarrassment, fear or nervous, pain, anger or upset and disgust) derived through a combination of interviews, planned activities, film watching, cold pressor test, social challenge and olfactory stimulation. The facial expressions in the dataset had been annotated using the Facial Action Coding System (FACS).
iSAFE (Indian Semi-Acted Facial Expression Database): iSAFE [
9] contains 44 volunteers from Indo-Aryan and Dravidian (Asian), 395 clips and seven emotions (happy, sad, fear, surprise, angry, neutral, disgust) captured with a camera behind aaptop, where the volunteers were asked to watch a few stimulant videos. The facial expressions were manually self-annotated by a user-interface portal and cross annotated by an annotator.
TAVER (Tri-modal Arousal-Valence Emotion Recognition database): TAVER [
26] contains 17 subjects from one ethnic origin (Korean). It presents a novel method that estimates dimensional emotion states taking color, depth, and thermal recording videos through human–human interaction. The emotion (arousal–valence) was elicited through embarrassing and stressing people by asking them questions in a differentanguage (English) than their own (Korean). The participants self-report feeling uncomfortable for the interviews with anotheranguage. Six human operators annotated the video sequence, with three annotators for each video sequence for more accuracy.
RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song): The RAVDESS [
27] dataset contains 24 participants from different ethnic groups (Caucasian, East-Asian, and Mixed (East-Asian Caucasian, and Black-Canadian First nations Caucasian)). The emotional elicitation in RAVDESS dataset is done through the true performance of emotion by actors. Actors were told to induce the desired state and provide genuine expressions of emotion. This dataset is particularly suited to machineearning approaches involving supervisedearning.
GFT (Group Formation Task): GFT [
28] contains 96 participants and 172,800 frames from aarger study on the impact of alcohol on group formation processes. In this study, participants affirmed that they could comfortably drink ateast three drinks in 30 min. They were seated around a circular table in an observation room where they were asked to consume a beverage and to discuss any topics except theirevel of intoxication.
SEWA (Automatic Sentiment Analysis in the wild): SEWA [
29] contains 398 participants of different nationality (British, German, Hungarian, Greek, Serbian, and Chinese), and 1990 audio-visual recording clips were collected during the experiment, comprised of 1600 min of audio-visual data of people’s reaction to adverts and 1057 min of video-chat recordings. To stimulate the emotions, the participants were asked to watch four advertisements, each being around 60 song. These adverts had been chosen to elicit mental states including amusement, empathy, liking and boredom. In a second part, the participants were divided into pairs based on their cultural background, age and gender (for natural interaction, each pair was required to know each other personally in advance). After watching the fourth advertisement, the two participants were asked to discuss, for three minutes on average, the advertisement they had just watched. The subtle changes in the participant’s emotional state (valence, arousal, andiking/disliking) were annotated by human operators from the same cultural background of the recorded subjects (five for each). The SEWA dataset contains annotations for facialandmarks, acousticow-level descriptors, hand gestures, head gestures, facial action units, verbal and vocal cues, continuously valued valence, arousal andiking/disliking, template behaviors, episodes of agreement/disagreement and mimicry episodes.
BioVid Emo (psychophysiological signals with video signals for discrete basic emotions): The BioVid Emo [
30] dataset combines psycho-physiological signals with video signals for discrete basic emotions that were effectively elicited by film clips from 86 participants. The psycho-physiological signals that have been considered in this study are: skin conductanceevel, electrocardiogram, trapezius electromyogram and four video signals. Five discrete emotions (amusement, sadness, anger, disgust and fear) were elicited by 15 standardized film clips.
ISED (Indian Spontaneous Expression Database): ISED [
31] contains 50 Indian subjects and 428 videos. Emotions were induced among the participants by using emotional videos and simultaneously their self-ratings were collected for each experienced emotion (sadness, surprise, happiness, and disgust).
4D CCDb (4D Cardiff Conversation Database): 4D CCDb [
32] contains four participants recording 17 conversations, which have been fully annotated for a speaker andistener activity: conversational facial expressions, head motion, and verbal/non-verbal utterances. The annotation tracks included were: front channel, backchannel, agree, disagree, utterance (verbal/non-verbal), happy (smile oraugh), interesting-backchannel, surprise-positive, surprise-negative, thinking, confusion, head nodding, head shake, head tilt and other.
MAHNOB Mimicry (The mahnob mimicry database: A database of naturalistic human interactions): MAHNOB Mimicry [
33] contains 60 subjects from staff and students at Imperial College London (Europe or the Near-East). The subjects were recorded over 54 sessions of dyadic interactions between 12 confederates and their 48 counterparts, being engaged either in a socio-political discussion or negotiating a tenancy agreement.
OPEN-EmoRec-II (A Multimodal Corpus of Human-Computer Interaction): OPEN-EmoRec-II [
34] has been designed in order to induce emotional responses in HCI users during two different parts of a HCI-experiment. It contains 30 subjects involving video, audio, physiology (SCL, respiration, BVP, EMG Corrugator supercilii, EMG Zygomaticus Major) and facial reaction annotations.
AVEC’14 (Audio-Visual Emotion recognition Challenge (AVEC 2014)): AVEC’14 [
35] contains 84 German subjects with 300 audio-visuals. The challenge has two goalsogically organized as sub-challenges: to predict the continuous values of the affective dimensions valence, arousal and dominance at each moment in time; and to predict the value of a single self-reported severity of depression indicator for each recording in the dataset.
DISFA (A Spontaneous Facial Action Intensity Database): DISFA [
36] contains 27 subjects from different ethic (Asian, Euro American, Hispanic, and African American) and 130,000 annotations. Participants viewed a four-minute video clip intended to elicit spontaneous Action Units (AUs) in response to videos intended to elicit a range of facial expressions of emotion.
RECOLA (REmote COLlaborative and Affective interactions = Multimodal Corpus of Remote Collaborative and Affective Interactions (in French: RECOLA)): RECOLA [
37] contains 46 subjects of different nationality (French, Italian, German and Portuguese). It is based on a study focusing on emotion perception during remote collaboration, where participants were asked to perform individual and group tasks.
AVEC’13 (Audio-Visual Emotion recognition Challenge (AVEC 2013)): AVEC’13 [
38] contains 292 German subjects and 340 audio-visuals. Subjects performed a human-computer interaction task, while being recorded by a webcam and a microphone.
CCDb (Cardiff conversation database): The CCDb [
39] 2D audiovisual dataset contains natural conversations between pairs of people. All 16 participants were fully fluent in the Englishanguage. It includes 30 audio-visuals.
DynEmo (Dynamic and spontaneous emotional facial expression database): The DynEmo [
40] dataset contains 358 Caucasian participants filmed in natural but standardized conditions. The participants were enrolled into ten tasks to display a subjective affective state rated by both the expresser (self-reported after the emotion inducing tasks, using dimensionally, action readiness and emotionalabels items) as well as the observers (continuous annotations).
DEAP (A Database for Emotion Analysis Using Physiological Signals): DEAP [
41] contains 32 mostly European students and 40 videos. Participants watched music videos and rated them on a discrete nine-point scale for valence, arousal and dominance.
SEMAINE: SEMAINE [
42] contains 24 undergraduate and postgraduate students between 22 and 60 years old. It consists of 130,695 frames of typical session duration for Solid SAL (Sensitive Artificial Listener) and semi-automatic SAL. In these sessions, participants were asked to change character when they got bored, annoyed or felt they had nothing more to say to the character.
MAHNOB-HCI (multimodal database for affect recognition and implicit tagging): MAHNOB-HCI [
43] illustrates 27 participants from different educational backgrounds, from undergraduate students to postdoctoral fellows, with different English proficiency from intermediate to native speakers. Participants were shown fragments of movies and pictures, while monitoring them with six video cameras, a head-worn microphone, an eye gaze tracker, as well as physiological sensors measuring ECG, EEG (32 channels), respiration amplitude, and skin temperature.
UNBC-McMaster (McMaster University and University of Northern British Columbia (UNBC)–Painful data: The UNBC-McMaster shoulder pain expression archive database): The UNBC-McMaster (UNBC Shoulder Pain Archive (SP)) [
44] dataset contains 25 participants who were self-identified as having a problem with shoulder pain. It contains physical pain/temporal expressions/spontaneous facial expressions relating to genuine pain, while discriminating 48,398 frames/200 video sequences.
CAM3D (3D corpus of spontaneous complex mental states): CAM3D [
45] (
Figure 2) contains 16 participants from different ethnic backgrounds (Caucasian, Asian and Middle Eastern). It involves 108 videos, where the use of hand-over-face gestures as a novel affects cues for automatic inference of cognitive mental states.
B3D(AC) (A 3-D Audio-Visual Corpus of Affective Communication): The B3D(AC) [
46] audio-visual corpus dataset contains 14 participants native English speakers and 1109 sequences. The annotation of the speech signal includes: transcription of the corpus text into the phonological representation, accurate phone segmentation, fundamental frequency extraction, and signal intensity estimation of the speech signals.
CK+ (Extended Cohn-Kanade Dataset): CK+ [
47] contains 593 sequences, where the 123 participants have performed series of 23 facial displays. It involves seven emotion categories.
AvID (Audiovisual speaker identification and emotion detection for secure communications): AvID [
48] contains 15 subjects, recorded while they describe neutral photographs, play a game of Tetris, describe the game of Tetris and solve cognitive tasks. A one-hour video is captured for each subject, discriminating four class emotions (neutral, relaxed, moderately aroused and highly aroused).
AVIC (Audiovisual Interest Corpus): AVIC [
49] contains 21 participants from Asian and European ethnic groups, while involving 324 episodes that consist of spontaneous as well as conversational speech demonstrating “theoretical” effectiveness of the approach.
DD (Detecting depression from facial actions and vocal prosody): The DD dataset [
50] illustrates 57 participants from a clinical trial for treatment of depression. Trials were conducted using the Hamilton Rating Scale for Depression (HRS-D), which is a criterion measure for assessing the severity of depression. Participant facial behavior was registered in response to the first three of 17 questions in the HRS-D, such that the questions concerned core features of depression: depressed mood, guilt, and suicidal thoughts.
SAL (The Sensitive Artificial Listener): The SAL [
51] dataset is based on the observation that it is possible for two people to have a conversation in which one paysittle or no attention to the meaning of what the other says and chooses responses on the basis of superficial cues. SAL provides a context in which sustained emotionally colored human–machine interaction seems to be achievable. It identifies the four users’ emotional state itself during sessions of 30 min for each user, using evidence from faces, upper body, voice, and key words. The range of emotions is wide, but they are not very intense.
HUMAINE (The HUMAINE Database: Addressing the Collection and Annotation of Naturalistic and Induced Emotional Data): HUMAINE [
52] contains 50 clips selected to cover material showing emotion in action and interaction spanning a broad emotional space (positive and negative, active and passive), selected from the following corpora: the Belfast Naturalistic dataset (in English, naturalistic, ten clips), the Castaway Reality Television dataset (in English, naturalistic, ten clips), Sensitive Artificial Listener (in English, induced, 12 clips), Sensitive Artificial Listener (in Hebrew, induced, one clip), Activity/Spaghetti dataset (in English, induced, seven clips), Green Persuasive dataset (in English, induced, four clips), EmoTABOO (in French, induced, two clips), DRIVAWORK corpus (in German, induced, one clip), and GEMEP corpus (in French, acted, one clip).
EmoTABOO (Collection and Annotation of a Corpus of Human-Human Multimodal Interactions: Emotion and Others Anthropomorphic Characteristics: consisting inetting pairs of people play the game “Taboo”): EmoTABOO [
53] is a French dataset containing ten audiovisual clips collected during game playing. People were playing at Taboo, a game in which one person has to explain to the other using gestures and body movement a ‘taboo’ concept or word. It involves multimodal interactions between two people and provides an emotional content, with a range of emotions including embarrassment, amusement, etc.
ENTERFACE: ENTERFACE [
54] includes acquisitions for three multimodal emotion detection modalities: the first modality is given by brain signals via fNIRS and contains 16 participants; the second modality includes face videos of five participants; and the third modality captures the scalp EEG signals of 16 participants. EEG and fNIRS provided an “internal”ook at the emotion generation processes, while video sequences gave an “external”ook on the “same” phenomenon.
UT-Dallas (University of Texas at Dallas): UT-Dallas [
55] contains 1540 video clips of 284 people of Caucasian descent walking and conversing. During filming, the subject watched a ten-minute video, which contained scenes from various movies and television programs intended to elicit different emotions in order to capture emotions such as happiness, sadness and disgust.
RU-FACS (Rochester/UCSD Facial Action Coding System): RU-FACS [
56] contains 100 subjects that attempted to convince an interviewer he or she is telling the truth. Interviewers were current and former members of the police and FBI.
MIT (The MIT Media Laboratory, Cambridge MA, USA): MIT [
57] contains over 25,000 frames scored of 17 drivers that gave their consent to having video and the physiological signals recorded during the drive.
UA-UIUC (University of Illinois at Urbana-Champaign): UA-UIUC [
58] contains 28 subjects and one video clip for each subject. First, the subjects could not know that they were being tested for their emotional state. Second, subjects were interviewed after the test to find out their true emotional state for each expression.
AAI (Adult Attachment Interview): The AAI [
59] dataset contains 60 subjects from different ethnic groups (European American and Chinese American). The subjects were interviewed and asked to describe the childhood experience. It contains one audiovisual for each subject.
Smile dataset (Dynamics Of Facial Expression: Normative Characteristics And Individual Differences): The Smile dataset [
60] contains 195 spontaneous smiles of 95 subjects. Videos were collected throughout a session that included baselines (seated with eyes open) and viewing of film clips.
Overall, the investigated datasets including spontaneous macro-expressions are the majority with 39 instances. The number of subjects included in such datasets ranges fromess than 50 to more than 500. The typical number of subjects is not related with other features, like age range or ethnic diversity or even the amount of data. For instance, the TAVER dataset includes 17 subjects, with an age range between 21 and 38 years and only one ethnicity (Korean); the DISFA dataset comprises 27 subjects with an age ranging between 18 and 50 years and four ethnicities (Asian, Euro American, Hispanic, and African American). Aarge number of subjects does not necessarily correspond to more diversity. For example, the DynEmo dataset with 358 subjects has an age that ranges between 25 and 65 years, and only one ethnicity (Caucasian). That being said, the SEWA dataset with 398 subjects, has an age ranging between 18 and 65 years, and six ethnicities (British, German, Hungarian, Greek, Serbian, and Chinese), and it contains annotations for facialandmarks, acousticow-level descriptors, hand gestures, head gestures, facial action units, verbal and vocal cues, continuously valued valence, arousal andiking/disliking (toward an advertisement), template behaviors, episodes of agreement/disagreement and mimicry episodes. Finally, each dataset handles a different class of emotions, the six basic emotions and neutral (iSAFE) or the six basic emotions and embarrassment and pain (BP4D-Spontaneous), four emotions (ISED) or even one emotion (smile dataset). Some other datasets represent emotions in form of valence and arousal (DEAP, AVEC’14).
2.2. Spontaneous and Posed Datasets
We consider herein the spontaneous and the posed datasets due to the fact that we are interested in the spontaneous part of it.
4DFAB (4D Facial Expression Database for Biometric Applications): The 4DFAB [
61] dataset includes six posed expressions, spontaneous expressions (anger, disgust, fear, happiness, sadness and surprise), and nine words utterances (puppy, baby, mushroom, password, ice cream, bubble, Cardiff, bob, rope). It contains recordings of 180 subjects captured in four different sessions spanning over a five-year period. This dataset encloses 4D videos of subjects displaying both spontaneous and posed facial behaviors.
BAUM-1 (Bahcesehir University Multimodal Affective Database-1): BAUM-1 [
16] contains 31 Turkish subjects and 1,184 multimodal facial video clips. The expressed emotional and mental states consist of happiness, anger, sadness, disgust, fear, surprise, boredom, contempt, confusion, neutral, thinking, concentrating, and bothered.
MAHNOB Laughter (The MAHNOB Laughter database): MAHNOB Laughter [
62] contains 22 subjects from 12 different countries and of different origins recorded in 180 sessions. In particular, there are 563 aughter episodes, 849 speech utterances, 51 posedaughs, 67 speech–laughs episodes and 167 other vocalizations annotated in the dataset.
PICS-Stirling ESRC 3D Face (Psychological Image Collection at Stirling-ESRC project 3D Face Database): PICS-Stirling ESRC 3D Face [
63] contains 99 subjects, a number of 2D images, video sequences as well as 3D face scans. Seven different expression variations were captured.
Hi4D-ADSIP (High Resolution 4 Dimensional Database from the Applied Digital Signal and Image Processing Research Centre): Hi4D-ADSIP [
12] contains 80 subjects from undergraduate students from the Performing Arts Department at the University as well as undergraduate students, postgraduate students and members of staff from other departments. It involves 3360 images/sequences and consists of seven basic facial expressions and further seven facial articulations.
USTC-NVIE (University of Science and Technology of China (USTC)-Natural Visible and Infrared Facial Expression Database for Expression Recognition and Emotion Inference): USTC-NVIE [
64] contains 215 subjects, 236 images and six basic expressions. Two kind of facial expressions were recorded: spontaneous expressions induced by the film clips and posed ones obtained by asking the subjects to perform some series of expressions in front of the cameras.
MMI-V (Induced Disgust, Happiness and Surprise: an Addition to the MMI Facial Expression Database): MMI-V [
65] contains 25 subjects from different ethnic groups (European, South American, and Asian) recorded in one hour and 32 min of data and 392 segments. Part IV of the dataset was annotated for the six basic emotions and facial muscle actions. Part V of the dataset was annotated for voiced and unvoicedaughter. There are Part IV and Part V because MMI-V dataset was added to the MMI [
66] facial expression dataset.
MMI (The acronym MMI comes from M&M Initiative where the Ms are the initials of the two main authors. Although other colleagues joined the development efforts of the main authors, the acronym remained in use): The MMI dataset contains 19 subjects from different ethic groups (European, Asian, or South American), 740 static images sequence of frontal and side view and 848 videos.
AVLC (The AVLaughterCycle Database): AVLC [
67] contains 24 subjects from different nationality and ethnic groups (Belgium, France, Italy, UK, Greece, Turkey, Kazakhstan, India, Canada, USA, and South Korea) and 1000 spontaneousaughs elicited by a funny movie and 27 actedaughs.
IEMOCAP (The Interactive Emotional Dyadic Motion Capture): IEMOCAP [
68] contains 120 actors (fluent English speakers) recorded in 12 h of audiovisual data, including video, speech, motion capture of faces and text transcriptions. The actors performed selected emotional scripts and also improvised spontaneous spoken communication scenarios to elicit specific types of emotions (happiness, anger, sadness, frustration and neutral state).
AMI (Augmented Multi-party Interaction): The AMI [
69] dataset contains a multi-modal set of data consisting of 100 h of meeting recordings, where some of them are naturally occurring, and some others are elicited. In thisatter case, a particular scenario is used where the participants play different roles in a design team, taking a design project from kick-off to completion over the course of a day.
Although we did not discuss posed expressions, we included spontaneous and posed macro-expressions in our survey with 11 datasets. In these categories, the 4DFAB dataset presents an interesting age range that covers infants and elders from 5 to 75 years. Furthermore, the USTC-NVIE dataset presents the highest number of subjects with 215 students. Although MAHNOB Laughter dataset contains an important ethnicity variation (12 different countries and of different origins), its average age is between 27 and 28 years.
2.3. In-the-Wild Datasets
In in-the-wild datasets, the human–human interaction results in a spontaneous expression, so that the emotional content and the experimental conditions are uncontrolled.
RAF-DB (Real-world Affective Faces Database): RAF-DB [
70] includes thousands of subjects with 30,000 facial images collected from Flickr.
Aff-Wild2 (Extending the Aff-Wild Database for Affect Recognition): The Aff-Wild2 dataset contains videos downloaded from YouTube with 258 subjects from infants and young children to elderly people [
71]. It illustrates various ethnicity groups (Caucasian, Hispanic or Latino, Asian, black, and African American), different professions (e.g., actors, athletes, politicians, journalists); as well as changes in head pose, illumination conditions, occlusions and emotions.
AM-FED+ (An Extended Dataset Affectiva-MIT Facial Expression Dataset): In the AM-FED+ [
72] dataset, 416 participants from around the world (theirocations are not known) were recruited to watch video advertisements. It contains 1044 videos of naturalistic facial responses to online media content recorded over the Internet.
AffectNet (Affect from the InterNet): AffectNet [
73] contains more than 1,000,000 facial images from the Internet of more than 450,000 participants, presenting valence and arousal in eight emotion categories.
AFEW-VA (Database for valence and arousal estimation in-the-wild): The AFEW-VA dataset [
74] (
Figure 4) contains 240 movie actors in a range of age between 8 and 76 years and 600 video clips.
Aff-Wild (Affectiva-MIT Facial Expression Dataset): Within the Aff-Wild dataset [
75], more than 500 videos were collected from YouTube, while capturing subjects displaying a number of spontaneous emotions. The data were tagged using emotion-related keywords such as feeling, anger, hysteria, sorrow, fear, pain, surprise, joy, sadness, disgust, love, wrath, contempt, etc.
EmotioNet (Annotating a million face images in the wild): EmotioNet [
17] contains one million images of facial expressions downloaded from the Internet, categorized within one of the 23 basic and compound emotion categories. The images have been annotated either with emotion category or with corresponding AUs.
FER-Wild (Facial Expression Recognition from World Wild Web): FER-Wild [
15] contains 24,000 Web images from Google, Bing, Yahoo, Baidu and Yandex. These image were categorized in nine categories (no-face, six basic expressions: happy, sad, surprise, fear, disgust, anger, neutral, none, and uncertain). The ’no-face’ category is defined in the following cases: there is no face in the image, there is a watermark on the face, the bounding box was not on the face or did not cover the majority of the face, the face is a drawing, animation, painted, or printed on something else, and the face is distorted beyond a natural or normal shape. The ’no-face’ is defined even if an expression could be inferred. The ’none’ category is defined when the images do not present the six basic emotions or neutral (such as sleepy, bored, tired, etc.). The ’uncertain’ category is defined when the annotators are unsure of the facial expressions.
Vinereactor (Reactions for vine videos): Vinereactor [
76] contains 222 mechanical tuckers works filmed with a webcam watching 200 random vine videos from the comedy vine.co channel to get their reactions.
CHEAVD (Chinese Natural Emotional Audio–Visual Database): CHEAVD [
77] is extracted from 34 films, two TV series and four other television shows presenting 26 non-prototypical emotional states, including the six basic ones, from 238 speakers.
HAPPEI (HAPpy PEople Images): HAPPEI [
78] contains 4886 images downloaded from Flickr of 8500 faces, manually annotated by four humanabelers. The emotions have been categorized according to groupevel happiness intensity (neutral, small smile, large smile, smal laugh, large laugh and thrilled).
AM-FED (Affectiva-MIT Facial Expression Dataset): AM-FED [
79] contains 242 facial videos of 242 webcam videos recorded in real world conditions of viewers, from a range of ages and ethnicities, while watching films. It isabeled for frame-by-frameabels for the presence of ten symmetrical FACS action units, four asymmetric (unilateral) FACS action units, two head movements, smile, general expressiveness, feature tracker fails and gender.
FER-2013 (Facial Expression Recognition 2013 dataset): FER-2013 [
11] contains 35,685 facial expressions from images queried from the web. Images were categorized based on the emotion shown in the facial expressions (happiness, neutral, sadness, anger, surprise, disgust, fear).
SFEW (Static Facial Expressions in the Wild): SFEW [
7] is an extracted dataset (by selecting frames) from the AFEW [
10] dataset.
AFEW (Acted Facial Expressions in the Wild): AFEW contains 330 subjects from fifty-four movie DVDs, 1426 sequences, seven emotions (anger, disgust, fear, sadness, happiness, neutral, and surprise) and 1747 numbers of expressions.
Belfast induced (Belfast Natural Induced Emotion Database): The Belfast induced dataset [
80] is divided into three tasks: Set 1 tasks contains 114 subjects from undergraduate students and encloses 570 audio-visuals. It is developed as stimuli for research into the individual differences that might influence human abilities to encode and decode emotional signals. Set 2 tasks contains 82 subjects from undergraduate students and postgraduate students or employed professionals, and encloses 650 audio-visuals. It is developed to allow comparison of these new tasks with more traditional film elicitors that had previously been validated for their ability to induce discrete emotions. Set 3 tasks contains 60 subjects from three different ethnic groups (Peru, Northern Ireland) and encloses 180 audio-visuals. It contains variants of the disgust and fear (both active/social) tasks and the amusement (passive/social) task from Set 1. The emotions were self reported by the participants.
VAM-faces (“Vera Am Mittag”–German TV talk show): The VAM-faces [
81] dataset consists of 12 h of audio-visual recordings of the German TV talk show “Vera am Mittag”, which were segmented into broadcasts, dialogue acts and utterances. It contains 20 speakers and a set of 1867 images (93.6 images per speaker on average).
FreeTalk (Tools and Resources for Visualising Conversational-Speech Interaction): The FreeTalk [
82] dataset contains four subjects from different countries having a conversation in English. It consists of two 90-minute multiparty conversations, and the naturalness of the dialogues is further indicated by the topics of the conversation.
EmoTV (emotional video corpus: TV interviews (monologue)): The EmoTV [
83] dataset is a corpus of 51 video clips recorded from French TV channels containing interviews. It contains 48 subjects interviewed with a range of 14 emotions classes (anger, despair, doubt, disgust, exaltation, fear, irritation, joy, neutral, pain, sadness, serenity, surprise, and worry).
BAUM-2 (a multilingual audio-visual affective face database): BAUM-2 [
13] contains 286 subjects from 122 movies and TV-series result 1047 video clips in twoanguages (Turkish, English). It involves eight emotions (anger, happiness, sadness, disgust, surprise, fear, contempt and neutral). The dataset also provides a set of annotations such as subject age, approximate head pose, emotionabels and intensity scores of emotions.
Overall, the twenty investigated in-the-wild macro-expressions datasets have the highest number of subjects, reaching thousands of subjects in the RAF-DB dataset, the highest diversity of emotions with 23 categories of emotions in EmotioNet, the maximum number of subjects with participants from around the world in the AM-FED+ dataset.
2.4. Other Categorizations of Macro-Expression Datasets
In the following, we propose other categorizations for the spontaneous and in-the-wild datasets. One way is that of considering the different ways the data have been collected:
- -
In
spontaneous datasets, unlike posed datasets, where participants are asked to perform an emotion, subjects’ emotions are stimulated. For example, in [
9], face expressions were captured when volunteers were asked to watch a few stimulant videos. In a similar way, in [
43], participants were shown fragments of movies and pictures. In [
31], emotional videos were used for each emotion, and in the dataset investigated in [
14], combined interviews, planned activities, film watching, cold pressor, test/social challenge and Olfactory stimulation were explored. In [
42], participants were told to change character when they got bored, annoyed or felt they had nothing more to say to the character. The dataset proposed in [
49] collected conversational speech, and the work in [
51] had been based on a conversation between two people in which one paysittle or no attention to the meaning of what the other says and chooses responses on the basis of superficial cues. In [
50], participants were from a clinical trial for treatment of depression, however, in [
27], the participant has a dialogue script with vignettes for each emotional category. In [
38], subjects had performed a human–computer interaction task, similarly to the work of [
39], where natural conversations between pairs of people were investigated. In [
59], subjects were interviewed and asked to describe the childhood experience, and in [
56], subjects tried to convince the interviewers they were telling the truth. In [
48], subjects had described neutral photographs, played a game of Tetris, described the game of Tetris and solved cognitive tasks. Differently, in [
57], a driver was recorded during the drive, and the work of [
52] presented an interaction from TV chat shows and religious programs and discussions between old acquaintances. In [
53], participants were playing a game in which one person has to explain to the other using gestures and body movement a ‘taboo’ concept or word.
- -
Within the framework of
in-the-wild datasets, the collected data come from movies [
10,
13], films, TV plays, interviews and talk shows [
77,
81,
83], videos downloaded from Youtube [
71], images and videos from the Internet [
17,
73,
84] as well as from Flickr [
70,
78].
Most of the datasets have classified emotions into the six basic categories (angry, disgust, fear, happy, sad, surprise) [
7,
64,
65,
66], with some datasets adding the neutral one [
9,
10,
11]. There are also datasets that further extended the basic six plus neutral expression model with one additional expression, like pain [
12], or contempt [
13]. Other datasets added more expressions, like happiness or amusement/sadness/surprise or startle/embarrassment/fear or nervous/pain/anger or upset/disgust [
14]. Actually, a variety of expressions can be found in the existing datasets over those indicated above. For example, there are twenty-three categories of emotion in [
17] according to [
85]; nine categories (no-face, six basic expressions, neutral, none, and uncertain) in [
15]; thirteen emotional and mental states are included in [
16], where the six basic emotions plus boredom and contempt are complemented with some mental states (i.e., confusion, neutral, thinking, concentrating, and bothered); four emotions (sadness, surprise, happiness, and disgust) are given in [
31]; with only one emotion (smile) being included in [
60,
79]. The Valence and Arousal expression model was instead followed in [
35,
41,
73,
75]. We note some datasets that also included Action Unit (AU) annotations. For instance, the EmotioNet [
17] and DISFA [
36] datasets have 12 AUs annotations, and in the CASME [
86] dataset, AUs are coded by two coders based on Ekman’s study.
Table 2 groups the datasets according to the different ways emotions are categorized.
It is worth mentioning that some datasets contain 3D scans of expressive faces. For example, 4DFAB [
61] contains 3D faces (over 1,800,000 3D meshes), and PICS-Stirling ESRC 3D Face Database [
63] presents 3D face scans along with 2D images and video sequences. Likewise, CAM3D [
45] is a 3D multimodal corpus dataset, and B3D(Ac) [
46] dataset presents facial expressions in dynamic 3-D face geometries. Likewise, BP4D+ [
25] contains high-resolution 3D dynamic imaging with a variety of sensors of the face, 4D CCDb [
32] is a 4D (3D Video) audio-visual dataset, BP4D-Spontaneous [
14] is a 3D video dataset of spontaneous facial expressions, and Hi4D-ADSIP [
12] presents a comprehensive 3D dynamic facial articulation dataset.
In what follows, we propose some other categorizations for macro-expression datasets:
Number of subjects:
Table 3 presents a classification of macro-expression datasets according to the number of subjects. Most of the datasets containess than 50 subjects, with just few datasets containing more than 500 subjects. The number of subjects can reach more than thousands, if the expressions are spontaneous or in-the-wild.
Age variation: There are many age ranges in macro-expression datasets. Most of the datasets include subjects in a relatively small range (from 18 to30 years), namely TAVER, RAVDESS, GFT, MAHNOB Mimicry, BP4D-Spontaneous, MAHNOB Laughter, DEAP, USTC-NVIE, MMI-V, AvID, AVIC, ENTERFACE, UT-Dallas, RU-FACS, UA-UIUC, AAI, iSAFE, and ISED. Some other datasets have a moderate range (18–60), including EB+, SEWA, BP4D+ (MMSE), BAUM-1, BioVid Emo, 4D CCDb, AVEC’14, DISFA, AVEC’13 AViD-Corpus, CCDb, DynEmo, SEMAINE, MAHNOB-HCI, Hi4D-ADSIP, CAM3D, B3D(AC), CK+, VAM-faces, and MM. Few datasets contain children, including CHEAVD, 4DFAB, BAUM-2, AFEW-VA, AFEW, and Aff-Wild2. However, child facial expressions were mixed within adult expression samples without differentiating them based on age or age group. On the other hand, in the CHEAVD dataset, the participants were divided into six groups of ages, and in the 4DFAB dataset, the age distribution includes five categories, with infants being in the 5–18 category. However, the datasets did not take into consideration the difference of the facial expressions according to the age.
Frame per second (FPS): In macro-expression analysis, the number of FPS is relevant depending on the application context. In the following datasets, the number of FPS is smaller or equal to 20: TAVER, AM-FED+, and AM-FED. Instead, the number of FPS is greater than 50 for the 4DFAB, 4D CCDb, MAHNOB-HCI, Hi4D-ADSIP, FreeTalk, iSAFE, and ISED datasets. Theargest number of FPS, equal to 120, is reached in the IEMOCAP dataset, which makes it a relevant source for studying macro expressions.
Ethnicity: The existing macro-expression datasets contain various ethnicities such as Latino (EB+, 4DFAB, Aff-Wild2, BP4D+, RU-FACS), Hispanic (EB+, 4DFAB, Aff-Wild2, BP4D+, BP4D-Spontaneous, DISFA), White (EB+, BP4D+), African (EB+, Aff-Wild2, BP4D+, BP4D-Spontaneous, DISFA), Asian (EB+, 4DFAB, Aff-Wild2, BP4D+, BP4D-Spontaneous, DISFA, CAM3D, MMI-V, AVIC, MMI, RU-FACS, iSAFE), and Caucasian (4DFAB, Aff-Wild2, RAVDESS, DynEmo, CAM3D, UT-Dallas). However, some datasets contain participants from around the world or randomly selected (RAF-DB, AM-FED+, GFT, AffectNet, AFEW-VA, EmotioNet, AM-FED, AFEW, FreeTalk).
Amount of data: Here, the main distinction is between datasets that include images;ike EB+, TAVER, Aff-Wild2, AM-FED+, AFEW-VA, SEWA, Aff-Wild, BAUM-1, BioVid Emo, Vinereactor, CHEAVD, 4D CCDb, OPEN-EmoRec-II, AVEC’14, RECOLA, AM-FED, AVEC’13, CCDb, DynEmo, DEAP, AFEW, Belfast induced, MAHNOB-HCI, UNBC-McMaster, CAM3D, B3D(AC), UT-Dallas, EmoTV, UA-UIUC, and AAI; and datasets that instead comprise videos;ike RAF-DB, AffectNet, EmotioNet, FER-Wild, HAPPEI, FER-2013, SFEW, USTC-NVIE, iSAFE, and ISED.