Next Article in Journal
Feasibility of a 3D Surgical Guide Technique for Impacted Supernumerary Tooth Extraction: A Pilot Study with 3D Printed Simulation Models
Next Article in Special Issue
An Interactive and Personalized Erasure Animation System for a Large Group of Participants
Previous Article in Journal
Optical Feedback Interferometry Based Microfluidic Sensing: Impact of Multi-Parameters on Doppler Spectral Properties
Previous Article in Special Issue
Applying Eye-Tracking Technology to Measure Interactive Experience Toward the Navigation Interface of Mobile Games Considering Different Visual Attention Mechanisms
 
 
Review
Peer-Review Record

3D Approaches and Challenges in Facial Expression Recognition Algorithms—A Literature Review

Appl. Sci. 2019, 9(18), 3904; https://doi.org/10.3390/app9183904
by Francesca Nonis *, Nicole Dagnes, Federica Marcolin and Enrico Vezzetti
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2019, 9(18), 3904; https://doi.org/10.3390/app9183904
Submission received: 8 August 2019 / Revised: 6 September 2019 / Accepted: 10 September 2019 / Published: 18 September 2019
(This article belongs to the Special Issue Human-Computer Interaction and 3D Face Analysis)

Round 1

Reviewer 1 Report

(1) The authors said FER could be Facial Expression Recognition or Facial Emotion Recognition. It needs more clear to disscuss when FER stands for Facial Expression Recognition, and when FER stands for Facial Emotion Recognition. 

(2) The symbols in Table 7 are not ideal descriptions, for example symbol “V” means High or Low.

 

Author Response

Reviewer 1

We thank the reviewer for the useful comments. We have revised the paper following the suggestions, and we hope that it could now be considered for publication.

The authors said FER could be Facial Expression Recognition or Facial Emotion Recognition. It needs more clear to discuss when FER stands for Facial Expression Recognition, and when FER stands for Facial Emotion Recognition.

The reviewer is right. We have inserted a clearer definition of Facial Emotion Recognition in Section 1, "Introduction to FER." Furthermore, we have explained the meaning of the FER acronym for this work.

"The last step analyzes the movement of facial features and classifies them into emotion or attitude categories, also taking the name of Facial Emotion Recognition, a topic of emotion recognition that involves the analysis of human facial expressions in multimodal forms. More generally, emotion recognition is the automatic processing of human emotions, most typically from facial expressions as well as from verbal expressions, but also body movement and gestures. The acronym FER, in literature, often refers to both facial expression recognition and facial emotion recognition [1]. In this paper, it stands for Facial Expression Recognition, the recognition of emotional states based on facial expressions."

 

The symbols in Table 7 are not ideal descriptions, for example symbol “V” means High or Low.

The suggestion is valuable. Table 7 has been revised to facilitate reading and understanding.

Table 7. Pros and cons of 2D and 3D methods

 

2D

3D

Illumination changes, head motions, aging, and facial make-up

2D images and videos suffer from these variations, which can affect performance

3D data is naturally robust to these variations, immune to illumination and to some extent to pose variations

 

 

 

Data acquisition

Trivial acquisition, possible with any device

Technology makes 3D acquisition easier and easier

 

 

 

Amount of data available

Large amount of data and public datasets

Only a few datasets are available but are meant to increase

 

 

 

Dimensional and computational costs

Very low costs

Greater dimensionality and, consequently, higher storage and computational costs

 

 

 

Facial surface measurements

Not enabled, it is a difficulty inherent in 2D modality 

3D enables true facial surface measurements

 

 

 

Hidden acquisition cameras

Available

Not available

 

 

 

Performances for low-intensity AUs

Poor performance, achieving recognition rate lower than 3D

Good performance for lower face AUs and low-intensity AUs

 

 

 

Acquisition and recognition

Frontal view recognition

Ear-to-ear frontal face acquisition

 

 

 

Neutral scanning

No need for neutral scanning

Often need for neutral scanning, a disadvantage for real-time applications

 

 

 

Availability of databases with AUs and dynamic facial samples

High availability of databases, including public ones

Still low availability due to technical problems. Increase in recent years, thanks to the ease of 3D video capture

 

 

 

Real-time

Easy for a small amount of data

Not always good due to large time consumption

 

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper conducts a study to understand facial expressions analysis and recognition by analyzing the limits and strengths of traditional and deep-learning facial expression recognition techniques, intending to provide the research community an overview of the results obtained looking to the next future. The authors also describe the most used databases to address the problem of facial expressions and emotions. The paper addresses an interesting facial expression recognition issue. The authors describe with a lot of details their initial thoughts and the reasons why this study is important. In addition, the paper includes results that might be useful for those interesting in facial expression recognition. At this point, I would like to make some recommendation to the authors in order to improve their paper and make it ready for a journal publication.

1. Section 1 is too large. I would recommend shortening this section and move content to Section 2 which would be the related work. (The Introduction section is about introducing the concept of the paper. Please revise.)

2. Materials and methods should become Section 3, however, I would suggest putting the tables in a supplementary document instead of in the paper. The reader because of the tables loses the reading flow.

3. Section 3 also covers a lot of prior work. This is a methods section and not a related work section. Please reorganize the paper.

4. A discussion section might also be useful for the reader. Which method is better? In which database? Why? When should someone consider method A and database C instead of method F and data set X? These are some question that should drive your writing about the discussion section.

5. What is the future work with respect to this paper? The authors address some next steps but do not propose any future use of the findings of this paper. (You should discuss your future work in your conclusion section.)

7. It might make sense to provide some links with the datasets and method (implementations) for those available. Otherwise, readers should search to find them.

11. A number of critical references in technical facial animation are omitted. Please consider including and discussing the below-suggested references.

--Face/off: Live facial puppetry

--Example-based facial rigging

--Realtime performance-based facial animation

--Realtime facial animation with on-the-fly correctives

--Kernel projection of latent structures regression for facial animation retargeting

--Real-Time Hierarchical Facial Performance Capture

--Real-time Facial Expression Transformation for Monocular RGB Video

--Structure-aware transfer of facial blendshapes

--A Practical Model for Live Speech Driven Lip-Sync

 

I believe that after addressing the following issues, the paper will be ready for publication.

Author Response

Reviewer 2

This paper conducts a study to understand facial expressions analysis and recognition by analyzing the limits and strengths of traditional and deep-learning facial expression recognition techniques, intending to provide the research community an overview of the results obtained looking to the next future. The authors also describe the most used databases to address the problem of facial expressions and emotions. The paper addresses an interesting facial expression recognition issue. The authors describe with a lot of details their initial thoughts and the reasons why this study is important. In addition, the paper includes results that might be useful for those interesting in facial expression recognition. At this point, I would like to make some recommendation to the authors in order to improve their paper and make it ready for a journal publication.

We thank the reviewer for appreciating our work and for the useful comments, which were of great help in revising the manuscript. We have revised the paper following the suggestions, and we hope that, with the new integrations, it could now be considered for publication.

Section 1 is too large. I would recommend shortening this section and move content to Section 2 which would be the related work. (The Introduction section is about introducing the concept of the paper. Please revise.)

We have shortened Section 1 by moving some contents to Section 2 "Basic Emotions and Action Units" and Section 7 "Role of time".

Section 2
"Paul Ekman [15], in 1971, defined a set of six emotions that are accepted as universal: anger, disgust, fear, happiness, sadness, and surprise. Paul Ekman and Wallace Friesen named this group as basic emotions, which are universally recognized regardless of language and culture and cannot be decomposed into smaller semantic labels. The seven characteristics of emotions identified by Ekman et al. [16] are: "presence in other primates, distinctive physiology, universal commonalities in antecedent events, quick onset, brief duration, automatic appraisal, and unbidden occurrence." Most research studies on FER have been limited to these six "cardinal" categories of emotions. However, humans make use of a much fuller range of facial expressions for everyday communication than these six, some are even combinations of these basic ones [17]. Martinez et al. state that “there are approximately 7000 different expressions that people frequently use in everyday life. Furthermore, some of the expressions can have multiple interpretations depending on the context in which they are shown” [18,19] (p. 64).
Another possibility for studying facial expressions is the use of action units. An Action Unit is the action of muscles typically seen when an individual produces facial expressions. Defined by Ekman and Friesen in 1978 [20], the Facial Action Coding System (FACS) is given by a set of action units (AUs) to classify the movements of a distinct muscle or a muscle group activation of facial expression".

Section 7
"Temporal dynamics of facial expression provide additional relevant information that is not available in static 2D or 3D images. Indeed, an emotion lasts from 250 milliseconds to 5 seconds [136]; a dynamic method may be useful for evaluating the intensity level of muscle activities and for classifying emotions."

 

Materials and methods should become Section 3, however, I would suggest putting the tables in a supplementary document instead of in the paper. The reader because of the tables loses the reading flow.

The paper has been revised and reorganized, and the titles of some sections have been changed. The structure is now more consistent with the contents presented. However, we have chosen to keep the database section separate from the methods section to focus the reader's attention on the types of approaches. The same structure has also been used in previous literature reviews on the same topics, for example, "Deep Facial Expression Recognition: A Survey" [13].

The reviewer is right. The tables make the reader lose the reading flow and, for this reason, we have moved them to Appendix A. 

 

Section 3 also covers a lot of prior work. This is a methods section and not a related work section. Please reorganize the paper.

The title "Methods" did not best represent the contents of the section. Therefore, we have chosen to change the title to "Conventional and deep learning-based approaches", adding a brief introduction. In this section are then presented and described both conventional and deep learning-based approaches for the recognition of emotions through facial expressions, referring to the most recent works in literature.

"3. Conventional and deep learning-based approaches

In this section, the main traditional methods for 3D facial expression recognition are presented, comparing feature-based, model-based, and multi-modal algorithms. Next, the leading deep learning techniques applied to FER for feature extraction and classification are described, distinguishing between 2D and 3D."

 

A discussion section might also be useful for the reader. Which method is better? In which database? Why? When should someone consider method A and database C instead of method F and data set X? These are some question that should drive your writing about the discussion section.

We have added some considerations in the discussion section, in order to guide a future reader in choosing the method to be used.

Next steps
"Considering the different approaches analyzed in the paper, the highest recognition rates were obtained by working on three-dimensional data, or with multi-modal algorithms using 2D and 3D data. These results confirm the advantages of using 3D images or videos compared to the most common 2D methods. Based on the amount of data available, it is possible to decide whether to apply a deep learning technique to FER. Neural networks need a large amount of labeled data and considerable processing power but allow to perform feature extraction and their classification in an end-to-end way, reaching a state-of-the-art level of recognition accuracy. Alternatively, a conventional approach is recommended, and the most widespread is the feature-based algorithm, whereas fewer works focused on model-based or multimodal-based approaches. It is used both for the automatic AU recognition and for the basic emotions, obtaining better performances for the latter. The Action Units are independent of the interpretation, and therefore more suitable to describe spontaneous facial behaviors. For this reason, thanks also to the technological development and the birth of the first databases that consider 3D dynamic and spontaneous facial expressions, in the future works we could expect a more in-depth study of the action units."

Conclusions
"Despite the higher dimensional and computational costs, and the greater difficulty of working in real-time, 3D methods have achieved better recognition rates than the more common 2D methods. For this reason, it is essential to use a dataset that contains 3D facial models or 3D video sequences, such as BU-3DFE, Bosphorus, BU-4DFE, D3DFACS, UPM-3DFE, BP4D, or a private one. Predicting the expression of the human face in real-time requires recognition as accurately and as quickly as possible, but it becomes quite complicated when compared to the static images because a video is a collection of many frames, not just a single frame."

 

What is the future work with respect to this paper? The authors address some next steps but do not propose any future use of the findings of this paper. (You should discuss your future work in your conclusion section.)

We have proposed some future works in Section 9 and in Section 10.

Next steps
"The next step will be to get closer to the real world. Nowadays, with the acquisition and processing tools available, accessible to all at affordable prices, it is easier to work with three-dimensional images and videos. The transition from 2D to 3D, although with some drawbacks above all related to dimensional and computational costs, is almost complete. The next step will see the use of 4D with the introduction of the variable time, with the need to speed up the recognition algorithms and the creation of new databases."

Conclusions
"With this study, we want to provide a guideline for newcomers who will address this topic, and take stock of neural networks, taking advantage of the golden age of AI. The most important works of recent years have been presented, highlighting the pros and cons and the best outcomes in the entire facial expression recognition field.

Soon the recognition of emotions in real-time will be beneficial in the field of artificial intelligence research, with the need to recognize as many emotions of different people in one frame and detect mixed emotions.

In our future work, we plan to include deep learning techniques, working on a private database containing three-dimensional videos and psychological validation of labeled emotions, to perform emotion recognition in the wild."

 

It might make sense to provide some links with the datasets and method (implementations) for those available. Otherwise, readers should search to find them.

We thank the reviewer for the advice. We have added the links to the datasets, inserting them in Table 4.

 

A number of critical references in technical facial animation are omitted. Please consider including and discussing the below-suggested references.

The suggestion is valuable. We have added Section 4 "Facial Animation", including and discussing all the recommended references.

"4. Facial Animation

Facial animation is an area of computer graphics that consists of methods and techniques for generating and animating models of a human, an animal, or a fantasy character face. Parke made the first efforts to represent and animate three-dimensional faces using computers in 1972 [99]. Computer-based facial expression modeling and character animation is not a new endeavor but has been considerable growth of interest in recent years.

Following the success for describing movements of facial muscles of the FACS and Action Units developed by Ekman and Friesen in 1978 [20], Platt [100] and Brennan [101] in the early-1980s produced, respectively, the first physically based muscle-controlled face model, and techniques for facial caricatures. Their studies gave birth to the first animated human character able to express emotion through facial expressions and body movements.

Different techniques exist for the generation of facial animation data: marker-based motion capture [102,103], markerless motion capture, audio-driven, and keyframe animation. Marker-based techniques are widely used for real-time facial animation thanks to their robustness but are not useful for retrieving fine-scale dynamics and require specialized sensors. To simplify the motion capture process, techniques without requiring markers or specialized tracking hardware came out leveraging depth sensors and structured-light based devices [104]. The researchers demonstrated the ability to track detailed facial expressions from a 3D sensor in real-time, but the system required an extensive set of pre-processed facial expressions and consequently a lengthy training session. The year later, Li et al. [105] used the same system replacing the Principal Component Analysis (PCA) model with an optimized rig, in order to reduce training poses and enable retargeting.

Real-time 3D low-cost sensors such as Microsoft's Kinect favored the development of new methodologies that simplify the procedure [106,107]. In [106], a user-specific dynamic expression model is created in an offline preprocessing step. The novel face tracking algorithm combines 2D color image and 3D depth map, simultaneously captured by Kinect, in a systematic way with user-specific blendshapes. This proposed method achieves high-quality performance-driven facial animation in real-time with a low-cost and markerless acquisition system, and more robust and accurate results than previous video-based methods. Li et al. [107] proposed a real-time facial animation system with adaptive tracking without any training or expression calibrations, achieving superior tracking fidelity than existing state-of-the-art techniques.

Other works in facial animation are [108–110]. Mousas and Anagnostopoulos [108] presented in their paper a novel mesh deformation method to automatically transfer facial blendshapes from a reference to a source face model. Parameters such as elasticity, mesh curvature descriptors were not considered, but the presented method achieves a lower error rate than the previous methodologies. Wei and Deng [109] studied speech animation in real-time based on live speech input, synthesizing but maintaining the realism of facial animation. Ouzounis et al. [110] presented a methodology that provides the ability to efficiently transfer facial animations to face models with different morphological variations.

Recently, Ma and Deng [111,112] presented a "complete pipeline to photo-realistically transform the facial expression for monocular video in real-time" [111] (p. 9) and a "real-time, automatic, geometry-based method for capturing fine-scale facial performance from monocular RGB video" [112] (p. 8).

Facial animation applications include communication, education, and scientific simulation area, even if the primary use remains animation films and computer games."

 

Platt, S.M.; Badler, N.I. Animating Facial Expressions. In Proceedings of the Proceedings of the 8th Annual Conference on Computer Graphics and Interactive Techniques; ACM: New York, NY, USA, 1981; pp. 245–252. Brennan, S.E. Caricature generator. Thesis, Massachusetts Institute of Technology, 1982. Dagnes, N.; Ben-Mansour, K.; Marcolin, F.; Marin, F.; Sarhan, F.R.; Dakpé, S.; Vezzetti, E. What is the best set of markers for facial movements recognition? Ann. Phys. Rehabil. Med. 2018, 61, e455–e456. Dagnes, N.; Marcolin, F.; Vezzetti, E.; Sarhan, F.-R.; Dakpé, S.; Marin, F.; Nonis, F.; Ben Mansour, K. Optimal marker set assessment for motion capture of 3D mimic facial movements. J. Biomech. 2019, 93, 86–93. Weise, T.; Li, H.; Van Gool, L.; Pauly, M. Face/Off: live facial puppetry. In Proceedings of the Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer Animation - SCA ’09; ACM Press: New Orleans, Louisiana, 2009; p. 7. Li, H.; Weise, T.; Pauly, M. Example-based Facial Rigging. In Proceedings of the ACM SIGGRAPH 2010 Papers; ACM: New York, NY, USA, 2010; pp. 32:1–32:6. Weise, T.; Bouaziz, S.; Li, H.; Pauly, M. Realtime Performance-based Facial Animation. In Proceedings of the ACM SIGGRAPH 2011 Papers; ACM: New York, NY, USA, 2011; pp. 77:1–77:10. Li, H.; Yu, J.; Ye, Y.; Bregler, C. Realtime facial animation with on-the-fly correctives. ACM Trans Graph 2013, 32, 42–42. Mousas, C.; Anagnostopoulos, C.-N. Structure-aware transfer of facial blendshapes. In Proceedings of the Proceedings of the 31st Spring Conference on Computer Graphics - SCCG ’15; ACM Press: Smolenice, Slovakia, 2015; pp. 55–62. Wei, L.; Deng, Z. A Practical Model for Live Speech-Driven Lip-Sync. IEEE Comput. Graph. Appl. 2015, 35, 70–78. Ouzounis, C.; Kilias, A.; Mousas, C. Kernel Projection of Latent Structures Regression for Facial Animation Retargeting. Workshop Virtual Real. Interact. Phys. Simul. 2017, 7 pages. Ma, L.; Deng, Z. Real‐Time Facial Expression Transformation for Monocular RGB Video. Comput. Graph. Forum 2019, 38, 470–481. Ma, L.; Deng, Z. Real-time hierarchical facial performance capture. In Proceedings of the Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games - I3D ’19; ACM Press: Montreal, Quebec, Canada, 2019; pp. 1–10.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

After carefully reading the revised version of the paper as well as the responses made by the authors of this paper, I feel confident that this is a strong and scientifically sound paper on facial expressions recognition. For this reason, I would like to recommend this paper for the Applied Sciences Journal. Well done!

Back to TopTop