Next Article in Journal
Theoretical and Experimental Study of Optimization of Polarization Spectroscopy for the D2 Closed Transition Line of 87Rb Atoms
Next Article in Special Issue
Topic-Oriented Text Features Can Match Visual Deep Models of Video Memorability
Previous Article in Journal
Social Responses to Virtual Humans: The Effect of Human-Like Characteristics
Previous Article in Special Issue
Detecting Deception from Gaze and Speech Using a Multimodal Attention LSTM-Based Framework
 
 
Article
Peer-Review Record

Guided Spatial Transformers for Facial Expression Recognition

Appl. Sci. 2021, 11(16), 7217; https://doi.org/10.3390/app11167217
by Cristina Luna-Jiménez *, Jorge Cristóbal-Martín, Ricardo Kleinlein, Manuel Gil-Martín, José M. Moya and Fernando Fernández-Martínez
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2021, 11(16), 7217; https://doi.org/10.3390/app11167217
Submission received: 21 June 2021 / Revised: 29 July 2021 / Accepted: 2 August 2021 / Published: 5 August 2021
(This article belongs to the Special Issue Computational Trust and Reputation Models)

Round 1

Reviewer 1 Report

This paper proposed "mask generator" to the Spatial Transformer to raise recognition rate. 

These masks are directly crafted from the estimated facial landmarks or assimilated as the resulting visual saliency maps from original image and practical to enhance the attention on relevant local regions.

I think that it will be a better paper if  the authors describe in more detail the reason for the different performance for the 2 datasets (AffectNet and FER-2013).

(1) Compared to the existing ST model, I think that it will be a better paper if you describe in more detail the reason why the proposed method was able to show superior performance.

(2) Could you please explain why there is a performance difference between the AffectNet data and the FER-2013 data-sets?

Author Response

RESPONSE TO REVIEWERS:

The authors appreciate very much the time dedicated by the reviewers to understand and make suggestions to improve the next version of this paper. Their comments have helped to increase the quality of this work and to explain some ideas in a clearer way. 

The answer to the questions is written below the question of the reviewer in blue and we have also cited most of the paragraphs added in the article to address all the reviewers’ concerns. 

 

Comments and Suggestions for Authors

This paper proposed "mask generator" to the Spatial Transformer to raise recognition rate. 

These masks are directly crafted from the estimated facial landmarks or assimilated as the resulting visual saliency maps from original image and practical to enhance the attention on relevant local regions.

I think that it will be a better paper if  the authors describe in more detail the reason for the different performance for the 2 datasets (AffectNet and FER-2013).

(1) REVIEWER1: Compared to the existing ST model, I think that it will be a better paper if you describe in more detail the reason why the proposed method was able to show superior performance.

(1) Thank you very much for your comment. We have added some extra lines in the conclusion section to explain better the advantages of our proposal compared to current STNs: 

‘The reason for this accuracy improvement compared to existing STNs relies on introducing more specific information to the attention mechanism, extracted from powerful pre-trained models that generate the saliency maps or the landmarks. The localization network receives these filtered images that contain the most relevant regions emphasized. 

These images with emphasized regions help the attention mechanism to concentrate only on those relevant regions to transform the original image. As a result of this transformation, the classification branch sees a filtered image without unnecessary material to solve the Facial Emotion Recognition task. For this reason, in the results, we have observed an improvement in most of our strategies compared to traditional STNs.‘

--------------------------------

(2) REVIEWER1: Could you please explain why there is a performance difference between the AffectNet data and the FER-2013 data-sets?

(2) Thank you for this comment. We have added an extra paragraph in the conclusions to explain this fact. 

‘Notice that there is a performance difference between AffectNet and FER-2013. The main reason for this difference may be the use of TL for the FER-2013 dataset.  This difference explains why the models trained with FER-2013 achieved higher accuracy rates. Another important difference between datasets is the separation of our best model compared to the baseline STNs. In the case of AffectNet, this difference is 0.35% against the 1.49% reached in FER-2013. These results could be explained by the number of samples of the datasets and the resolution of their images. AffectNet contains more images with good quality and resolution, whereas FER-2013 has fewer images taken in more challenging conditions. Results seem to indicate that our proposals are more competitive for complex scenarios where the number of samples is limited. This result is coherent since in our strategies the STN only has to learn from the filtered masks (landmarks or saliency maps) and not from the complete image that is more complex and the localization network requires more samples to discover and learn patterns. ‘

Author Response File: Author Response.pdf

Reviewer 2 Report

The article introduces different extensions to improve the performance of conventional Spatial Transformers when applied to Facial Expression Recognition. In general, it is well-written in English and includes an extensive ablation study in order to assess both the mask generator module and the spatial transformer.

Despite the methodology followed is clearly described, the paper has several weaknesses, in my opinion, and still needs some explanations, which are enumerated below:

1) The motivation and potential applications of Facial Expression Recognition are barely mentioned at the beginning of the article, in order to contextualize its main contributions. Please, complete the introduction section with this discussion and add references to support it.

2) As authors mention at the end of section 2.1, "the main difference between this study [15] and ours is that they require the ground truth of the interest regions while we do not need previous annotations or specialized networks more than the ST". However, in order to compute landmarks-based masks, they make use of pre-trained models for face detection (MTCNN [28]) and face landmark detection (taken from dlib library [29]), which require of manually labeled faces and facial landmarks (the same happens when visual saliency-based maps are computed by using a pre-trained CNN on SALICON database). Could you better explain the differences between the system in [15] and your proposal?

3) I suggest including a figure to illustrate the mechanism of the Spatial Transformer Network used and described in section 3.2. The figure could show, given an example image, both the mapping established by the grid generator and the interpolation carried out by the sampler module, for the sake of ease of comprehension.

4) Given the effectiveness of visual saliency-based masks w.r.t. landmarks-based masks, which is demonstrated in FER-2013 database, and also the slightly improvement achieved by landmarks-soft masks w.r.t. binary masks on AffectNet database, did you try considering, for both databases, a "soft" version of the saliency maps, normalizing them to be in the range from 0.5 (background) to 1 (most salient regions)?  

5) What would be the accuracy of a system which, originally set for using a ST with landmarks, automatically replaces it with a ST with saliency masks when a face is not detected? Would it be comparable or superior to the performance achieved by using only the ST with saliency masks?

6) Including a comparison with some related approaches in the state-of-the-art for Facial Expression Recognition (see references at the end of section 2.2) could serve to further demonstrate and put into context the performance of the best configuration extracted from the experiments described in section 4, and also to notably enhance the relevance of the article.

Author Response

RESPONSE TO REVIEWERS:

The authors appreciate very much the time dedicated by the reviewers to understand and make suggestions to improve the next version of this paper. Their comments have helped to increase the quality of this work and to explain some ideas in a clearer way. 

The answer to the questions is written below the question of the reviewer in blue and we have also cited most of the paragraphs added in the article to address all the reviewers’ concerns. 

 

Comments and Suggestions for Authors:

The article introduces different extensions to improve the performance of conventional Spatial Transformers when applied to Facial Expression Recognition. In general, it is well-written in English and includes an extensive ablation study in order to assess both the mask generator module and the spatial transformer.

Despite the methodology followed is clearly described, the paper has several weaknesses, in my opinion, and still needs some explanations, which are enumerated below:

1) REVIEWER2: The motivation and potential applications of Facial Expression Recognition are barely mentioned at the beginning of the article, in order to contextualize its main contributions. Please, complete the introduction section with this discussion and add references to support it.

1) Thank you for your comment. We have completed the introduction with the motivation and some applications of FER to contextualize the proposal: 

‘To evaluate the viability of this idea, we test our proposal on a Facial Emotion Recognition task, given its interest in different fields. Recognizing emotions lets us efficiently interact with others. By analyzing user reactions are also possible to detect loss of trust or changes in emotions in ECA’s (Embodied Conversational Agent), letting react to this event and adapt machine behaviors to improve interactions or modify the dialogue content, tone, facial expression (if it has) to create a better socio-affective user experience [7]. 

Also, systems able to recognize certain emotions or deficits of them could help to diagnose certain diseases like depressive disorders [8], parkinson [9], etc., and improve the treatment of the patients. 

Another relevant application of facial expression recognition is for automotive safety. Recognizing negative emotions like stress, anger or fatigue is crucial to avoid traffic accidents and increase the security on the road [10] on intelligent vehicles, allowing them to act accordingly to the state of the driver. 

 The selected datasets for our work are: AffectNet [11] and FER-2013 [12]. In both datasets, the strategies that employ masks instead of the original images significantly improve the results reached by the conventional STN model.’

 

ADDED REFERENCES:

[7] de Visser, E.J.; Pak, R.; Shaw, T.H. From ‘automation’ to ‘autonomy’: the importance of trust repair in human–machine interaction. Ergonomics 2018,61, 1409–1427.  doi:10.1080/00140139.2018.1457725.

[8] Nyquist, A.C.; Luebbe, A.M. An Emotion Recognition–Awareness Vulnerability Hypothesis for Depression in Adolescence: A Systematic Review. Clinical Child and Family Psychology Review 2019,23, 27–53. doi:10.1007/s10567-019-00302-3 

[9] Argaud, S.; Vérin, M.; Sauleau, P.; Grandjean, D. Facial emotion recognition in Parkinson's disease: A review and new hypotheses. Movement Disorders 2018, 33, 554–567.  doi:10.1002/mds.27305

[10] Zepf, S.; Hernandez, J.; Schmitt, A.; Minker, W.; Picard, R.W. Driver Emotion Recognition for Intelligent Vehicles: A Survey. ACM Comput. Surv. 2020, 53.  doi:10.1145/3388790.

--------------------------------

 

2) REVIEWER2: As authors mention at the end of section 2.1, "the main difference between this study [15] and ours is that they require the ground truth of the interest regions while we do not need previous annotations or specialized networks more than the ST". However, in order to compute landmarks-based masks, they make use of pre-trained models for face detection (MTCNN [28]) and face landmark detection (taken from dlib library [29]), which require of manually labeled faces and facial landmarks (the same happens when visual saliency-based maps are computed by using a pre-trained CNN on SALICON database). Could you better explain the differences between the system in [15] and your proposal?

2) Thank you for your question. We have extended the paragraph in which we speak about the comparison for clarity. 

In summary, there are several differences, one is the domain of application since they require a specific model to solve their task, image registration, this fact condition all their pipeline and for this reason they require to annotate the data, in our case, we use pre-trained models that automatically generated the areas of interest. In comparison, our reported results contain the errors introduced by these systems, as we comment in the analysis of the appendix, and get probably lower accuracy than a tailored system trained with manually annotated data of the most relevant regions, but our methodology is more generic and could be adapted to other domains since everything is automatically extracted without new human annotation steps. Another difference is the ablation study that we perform in this work, we compare several strategies based on facial-landmarks and on saliency maps, whereas they only compare landmarks adapted to their domain and annotated segments of the most important regions. The advantage of the saliency maps is that they are more general and for many tasks this model could be general enough for being used with good results without a specific fine-tuning of the network to adapt it to the task. 

Another difference is the nature of the landmarks, although they are referenced as the same, the morphology of our extractor and their extractor are completely different since their model is used for processing medical images while our extractor is used for extracting facial images, which requires the detection and existence of a face in the image. 

We have added the following paragraphs in Section 2.1 to clarify these ideas. Now the reference [15] has changed to [19]:

In this third group is also the work of M.C.H.Lee et al.  [19]. In their article, the authors argue that their 'Image-and-Spatial-Transformer Networks' (ISTN) could improve the medical image registration problem, which consists of the alignment of several images. They propose to add an extra network on top of an STN. The top network (ITN) generates the segments that conform to the input image. Then, the STN predicts the transformation matrix to align the images from the generated segments of the ITN. However, for the training of the ITN, it is necessary to have the ground truth of the images with the landmarks of the segments correctly annotated to train the ITN network.

Our proposal follows a similar idea in a more general way because we do not have access to the ground truth of the attention regions. Instead, we evaluate the images generated automatically from different general-purpose pre-trained networks that emphasize the most relevant areas of an input image. With these generic masks, we do an ablation study to assess the impact of passing each of them to the localization network of the STN. Results reveal that their inclusion enhances the performance of conventional-STNs for the emotion recognition task. 

One of the advantages of our proposal is that we do not need to re-train the models that extract the masks since they are general-purpose models that can be applied to several tasks. The use of these generalistic models reduces the manual annotation effort required to create supervised-tailored models. 

Apart from the similarities with [19], we also have several differences. The first one is that we apply our proposal to a different domain, thus we require other landmarks extractors that detect morphologically different regions. Another of the main differences is the architecture of the STN that we use since we need to add some extra layers after predicting the transformation matrix parameters to address the facial emotion recognition task. Finally, we extend the ablation study that they do by evaluating modified versions of the landmarks and the idea of using the saliency maps, everything extracted automatically without re-training these models. '

--------------------------------

3) REVIEWER2: I suggest including a figure to illustrate the mechanism of the Spatial Transformer Network used and described in section 3.2. The figure could show, given an example image, both the mapping established by the grid generator and the interpolation carried out by the sampler module, for the sake of ease of comprehension.

3) Thank you for your suggestion. We have added a figure (Figure 2) to illustrate the outputs of the grid generator and the sampler. We have also extended and modified the explanation of the STNs in Section 3.2 to make it more comprehensive.

--------------------------------

4) REVIEWER2: Given the effectiveness of visual saliency-based masks w.r.t. landmarks-based masks, which is demonstrated in the FER-2013 database, and also the slight improvement achieved by landmarks-soft masks w.r.t. binary masks on AffectNet database, did you try considering, for both databases, a "soft" version of the saliency maps, normalizing them to be in the range from 0.5 (background) to 1 (most salient regions)?  

4) Thank you very much for this idea. Actually we came up with the same one, however we finally did not consider the soft-version for the saliency maps because most visual saliency models differ from the landmark extraction ones since they do not predict relevant regions by suggesting a few fixation points in the image. Instead, visual saliency models rather highlight uniformly the entire salient regions either in the image background or foreground (current state of the art saliency models combine or fuse both foreground and background cues extracted from the image).

Here, we worked under the assumption that saliency would be able to accurately and smoothly highlight not only evident faces in the foreground but also other relevant regions from the background (if any). Hence, coherently, no soft version of the saliency maps was actually needed.

--------------------------------

5) REVIEWER2: What would be the accuracy of a system which, originally set for using a ST with landmarks, automatically replaces it with a ST with saliency masks when a face is not detected? Would it be comparable or superior to the performance achieved by using only the ST with saliency masks?

5) Thank you for your suggestion, we had some of these ideas in mind for future lines but we have had time to include some of them related to your questions. More specifically, we have done 2 extra experiments:

The first one consists of training the STN with landmarks (the binary mask strategy) but introducing saliency maps when the face is not detected, instead of pure white masks covering the whole image. We have called this strategy ‘STN with landmarks - binary masks v2’

The other experiment consists of using the predictions of the STN trained on saliency maps when the face is not detected, ignoring the prediction of the STN trained with landmarks.  We have called this strategy ‘STN with landmarks - binary masks v3’

In Tables 1 and 2, you can see the results of these experiments. It seems that these 2 versions compared with the v1 in which we use white images obtain comparable results, being the winner the v3 strategy although it does not outperform the saliency version.This small improvement could also be explained by the low rate of lost images in AffectNet that was 1.6%, for FER-2013 we could expect a higher increment since we lost a 13.57% of the images but results seem similar, probably because these images did not contain faces or have many occlusions, as we comment in the Appendixes. Maybe applying these ideas to the soft-masks version could increase the result a bit more but outperforming saliency version is neither guaranteed in this case. 

We have added a mention to these experiments in Section 3.1.1 and in the Results section on AffectNet dataset:

Section 3.1.1: 

‘Notice that for some samples, the facial detector fails and does not detect any face. In those cases, it is not possible to generate the landmarks mask of that sample. As the STN always expects an image, we tested several ideas to substitute the landmarks masks on these cases: the first one that appears in tables as v1, consists on substitute those samples with full-white images, to indicate to the model that all the pixels are equally relevant; the second option or v2, was to introduce the saliency maps instead of the white images; and the third option or v3, was to ignore the predictions of the model trained with landmarks mask and rely on the predictions of another model for those images, in our case, the second model was trained with saliency maps and appears in Tables 1 and 2 as 'STN with saliency masks'. 

For simplicity, the rest of the strategies derived from the landmarks use full-white images (v1) when the face is not detected. 

In the following subsections, we will detail how we extract the saliency maps and how we train each model with the different masks.’ 

Results Section: 

‘Regarding the method to deal with faces miss-detected by the facial detector, we can see that the best option is version 3.

Changing the white images (v1) by saliency maps (v2) does not improve the final accuracy, probably because for AffectNet the number of lost faces was not too large, a 1.6\%, or because training the STN with landmarks and saliency maps introduce some noise during the learning process because of the use of different ways of representing the information. The third version that relied on predictions of the STN trained with saliency maps for those cases improves the accuracy slightly, but it still does not surpass the results obtained by the STN trained with saliency masks. This tendency is also observed in the results for the FER-2013 dataset in Table 2.’ 

--------------------------------

6) REVIEWER2: Including a comparison with some related approaches in the state-of-the-art for Facial Expression Recognition (see references at the end of section 2.2) could serve to further demonstrate and put into context the performance of the best configuration extracted from the experiments described in section 4, and also to notably enhance the relevance of the article.

6) Thanks for your comment. We did not include a comparison in the article because we had to modify the labels of the datasets (AffectNet and FER-2013) to adapt categories, in both cases, we changed the problem to recognizing positive, negative, and neutral valence instead of the original emotions (different in the 2 corpora), as we comment in Section 3. Additionally, the main target of this article was not evaluating the performance of STs against other completely different approaches but truly assessing the contribution of the localisation network and evaluating alternative strategies or modifications to improve its performance (we have added some extra lines to clarify this target). For this reason, we used a simple architecture and trained it with our strategies to compare if it was able to beat the conventional Spatial Transformer. As future lines, we would like to increase the size of this network and investigate possible architectures which combined with our masks could reach state-of-the-art performance. As a reference of the power of these models for the task of Facial Emotion Recognition, we can see that in [5], the authors reached good accuracy rates on several datasets. This result encourages us to think that it is possible to find an architecture that combined with our ideas could beat state-of-the-art on FER. However, this is not the contribution of the current publication. 





Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

The authors have addressed all my concerns, so I think the article can be accepted in its present form. Congratulations on your research!

Back to TopTop