Semi-Supervised Group Emotion Recognition Based on Contrastive Learning
Abstract
:1. Introduction
2. Related Work
2.1. Group Emotion Recognition
2.2. Contrastive Learning
2.3. Semi-Supervised Learning
3. Proposed Methods
3.1. The SFNet
3.2. The FusionNet
3.3. Training Process
3.3.1. Stage 1: Pretraining SFNet with Contrastive Learning
3.3.2. Stage 2: Pretraining SFNet and FusionNet with Labeled Data
3.3.3. Stage 3: Giving Unlabeled Data with Pseudo-Labels
3.3.4. Stage 4: Further Training of SSGER
4. Experiments and Discussion
4.1. Implemention Details
4.2. Comparison on Classification Performance
4.3. Ablation Study
4.3.1. Ablation Study on Contrastive Learning
4.3.2. Ablation Study on Pseudo-Labels
4.3.3. Ablation Study on WCE-Loss
4.4. Results of Different Label Rate
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Barsade, S.G.; Gibson, D.E. Group Emotion: A View from Top and Bottom, Research on Managing Groups and Teams; JAI Press Inc.: Stamford, CT, USA, 2008. [Google Scholar]
- Dhall, A.; Asthana, A.; Goecke, R. Facial Expression Based Automatic Album Creation. In Proceedings of the International Conference on Neural Information Processing, Sydney, Australia, 6 December 2010; pp. 485–492. [Google Scholar]
- Meftah, I.T.; Le Thanh, N.; Amar, C.B. Detecting Depression Using Multimodal Approach of Emotion Recognition. In Proceedings of the 2012 IEEE International Conference on Complex Systems (ICCS), Agadir, Morocco, 5–6 November 2012; pp. 1–6. [Google Scholar]
- Basavaraju, S.; Sur, A. Image memorability prediction using depth and motion cues. IEEE Trans. Comput. Soc. Syst. 2020, 7, 600–609. [Google Scholar] [CrossRef]
- Khosla, A.; Raju, A.S.; Torralba, A.; Oliva, A. Understanding and Predicting Image Memorability at a Large Scale. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2390–2398. [Google Scholar]
- Clavel, C.; Vasilescu, I.; Devillers, L.; Richard, G.; Ehrette, T. Fear-Type emotion recognition for future audio-based surveillance systems. Speech Commun. 2008, 50, 487–503. [Google Scholar] [CrossRef] [Green Version]
- Park, C.; Ryu, J.; Sohn, J.; Cho, H. An Emotion Expression System for the Emotional Robot. In Proceedings of the 2007 IEEE International Symposium on Consumer Electronics, Irving, TX, USA, 20–23 June 2007; pp. 1–6. [Google Scholar]
- Xie, Q.; Luong, M.-T.; Hovy, E.; Le, Q.V. Self-Training with Noisy Student Improves Imagenet Classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10687–10698. [Google Scholar]
- Sohn, K.; Berthelot, D.; Carlini, N.; Zhang, Z.; Zhang, H.; Raffel, C.A.; Cubuk, E.D.; Kurakin, A.; Li, C.-L. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Adv. Neural Inf. Process. Syst. 2020, 33, 596–608. [Google Scholar]
- Gao, J.; Wang, J.; Dai, S.; Li, L.-J.; Nevatia, R. Note-rcnn: Noise Tolerant Ensemble rcnn for Semi-Supervised Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9508–9517. [Google Scholar]
- Hoffman, J.; Guadarrama, S.; Tzeng, E.S.; Hu, R.; Donahue, J.; Girshick, R.; Darrell, T.; Saenko, K. LSDA: Large scale detection through adaptation. In Proceedings of the 27th International Conference on Neural Information Processing Systems—Volume 2 (NIPS’14), Montreal Canada, 8 December 2014; MIT Press: Cambridge, MA, USA, 2014; pp. 3536–3544. [Google Scholar]
- Khan, A.S.; Li, Z.; Cai, J.; Tong, Y. Regional Attention Networks With Context-Aware Fusion for Group Emotion Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 5–9 January 2021; pp. 1150–1159. [Google Scholar]
- Wang, K.; Zeng, X.; Yang, J.; Meng, D.; Zhang, K.; Peng, X.; Qiao, Y. Cascade Attention Networks for Group Emotion Recognition with Face, Body and Image Cues. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, Boulder, CO, USA, 16–20 October 2018; pp. 640–645. [Google Scholar]
- Dhall, A.; Goecke, R.; Gedeon, T. Automatic group happiness intensity analysis. IEEE Trans. Affect. Comput. 2015, 6, 13–26. [Google Scholar] [CrossRef]
- Tan, L.; Zhang, K.; Wang, K.; Zeng, X.; Peng, X.; Qiao, Y. Group Emotion Recognition with Individual facial Emotion CNNs and Global Image Based CNN. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, Glasgow, UK, 13–17 November 2017; pp. 549–552. [Google Scholar]
- Surace, L.; Patacchiola, M.; Battini Sönmez, E.; Spataro, W.; Cangelosi, A. Emotion Recognition in the Wild Using Deep Neural Networks and Bayesian Classifiers. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, New York, NY, USA, 3 November 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 593–597. [Google Scholar]
- Fujii, K.; Sugimura, D.; Hamamoto, T. Hierarchical group-level emotion recognition. IEEE Trans. Multimed. 2020, 23, 3892–3906. [Google Scholar] [CrossRef]
- Bawa, V.S.; Kumar, V. Emotional sentiment analysis for a group of people based on transfer learning with a multi-modal system. Neural Comput. Appl. 2019, 31, 9061–9072. [Google Scholar] [CrossRef]
- Li, D.; Luo, R.; Sun, S. Group-Level Emotion Recognition Based on Faces, Scenes, Skeletons Features. In Proceedings of the Eleventh International Conference on Graphics and Image Processing (ICGIP 2019), Hangzhou, China, 12–14 October 2019; pp. 46–51. [Google Scholar]
- Li, J.; Roy, S.; Feng, J.; Sim, T. Happiness Level Prediction with Sequential Inputs via Multiple Regressions. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, Tokyo, Japan, 12–16 November 2016; pp. 487–493. [Google Scholar]
- Wang, Y.; Zhou, S.; Liu, Y.; Wang, K.; Fang, F.; Qian, H. ConGNN: Context-consistent cross-graph neural network for group emotion recognition in the wild. Inf. Sci. 2022, 610, 707–724. [Google Scholar] [CrossRef]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
- Chen, X.; He, K. Exploring Simple Siamese Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15750–15758. [Google Scholar]
- Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
- Deng, Y.; Yang, J.; Chen, D.; Wen, F.; Tong, X. Disentangled and Controllable Face Image Generation via 3d Imitative-Contrastive Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5154–5163. [Google Scholar]
- Dhall, A.; Joshi, J.; Sikka, K.; Goecke, R.; Sebe, N. The More the Merrier: Analysing the Affect of a Group of People in Images. In Proceedings of the 2015 11th IEEE international conference and workshops on automatic face and gesture recognition (FG), Ljubljana, Slovenia, 4–8 May 2015; pp. 1–8. [Google Scholar]
- Dhall, A.; Kaur, A.; Goecke, R.; Gedeon, T. EmotiW 2018: Audio-Video, Student Engagement and Group-Level Affect Prediction. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, New York, NY, USA, 21–26 April 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 653–656. [Google Scholar]
- Guo, X.; Polania, L.; Zhu, B.; Boncelet, C.; Barner, K. Graph Neural Networks for Image Understanding Based on Multiple Cues: Group Emotion Recognition and Event Recognition as Use Cases. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
- Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Lee, D.-H. Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks. Workshop Chall. Represent. Learn. ICML 2013, 3, 896. [Google Scholar]
- Hao, F.; Ma, Z.-F.; Tian, H.-P.; Wang, H.; Wu, D. Semi-supervised label propagation for multi-source remote sensing image change detection. Comput. Geosci. 2022, 170, 105249. [Google Scholar] [CrossRef]
- Chin, T.-J.; Wang, L.; Schindler, K.; Suter, D. Extrapolating Learned Manifolds for Human Activity Recognition. In Proceedings of the 2007 IEEE International Conference on Image Processing, San Antonio, TX, USA, 16–19 September 2007; pp. 1–381. [Google Scholar]
- Blum, A.; Mitchell, T. Combining Labeled and Unlabeled Data with Co-Training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, Madison, WI, USA, 24–26 July 1998; pp. 92–100. [Google Scholar]
- Chen, C.; Wu, Z.; Jiang, Y.G. Emotion in Context: Deep Semantic Feature Fusion for Video Emotion Recognition. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 127–131. [Google Scholar]
- Batbaatar, E.; Li, M.; Ryu, K.H. Semantic-Emotion Neutral Network for Emotion Recognition from Text. IEEE Access 2019, 7, 111866–111878. [Google Scholar] [CrossRef]
- Abbas, A.; Chalup, S.K. Group Emotion Recognition in the Wild by Combining Deep Neural Networks for Facial Expression Classification and Scene-Context Analysis. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, Glasgow, UK, 13–17 November 2017; pp. 561–568. [Google Scholar]
- Fujii, K.; Sugimura, D.; Hamamoto, T. Hierarchical Group-Level Emotion Recognition in the Wild. In Proceedings of the 14th IEEE International Conference on Automatic Face & Gesture Recognition, Lille, France, 14–18 May 2019; pp. 1–5. [Google Scholar]
- Quach, K.G.; Le, N.; Duong, C.N.; Jalata, I.; Roy, K.; Luu, K. Non-Volume preserving-based fusion to group-level emotion recognition on crowd videos. Pattern Recognit. 2022, 128, 108646. [Google Scholar] [CrossRef]
- Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 2016, 23, 1499–1502. [Google Scholar] [CrossRef]
Positive | Neutral | Negative | Total | ||
---|---|---|---|---|---|
train | 1272 | 1199 | 1159 | 3630 | |
GAF2 | val | 773 | 728 | 564 | 2065 |
test | - | - | - | - | |
train | 3977 | 3080 | 2758 | 9815 | |
GAF3 | val | 1747 | 1368 | 1231 | 4346 |
test | - | - | - | - | |
train | 4645 | 3463 | 3019 | 11,127 | |
Group- EmoW | val | 1327 | 990 | 861 | 3178 |
test | 664 | 494 | 431 | 1589 |
Label Rate of Training Set | Semi-Supervised GER on the GAF2 Dataset | ||||
---|---|---|---|---|---|
Method | Positive | Neutral | Negative | Overall | |
5% | ResNet-50 | 56.79 | 46.97 | 83.21 | 60.43 |
SSGER | 79.43 | 72.07 | 72.14 | 74.90 | |
10% | ResNet-50 | 53.82 | 64.18 | 79.15 | 64.23 |
SSGER | 83.05 | 69.68 | 73.25 | 75.74 | |
30% | ResNet-50 | 76.33 | 58.39 | 76.94 | 70.21 |
SSGER | 80.98 | 75.04 | 72.88 | 76.73 | |
100% | ResNet-50 | 76.46 | 61.21 | 82.29 | 72.68 |
Surace + [16] | 68.61 | 59.63 | 76.05 | 67.75 | |
Abbas + [38] | 79.76 | 66.20 | 69.97 | 71.98 | |
Fujii + [39] | 75.68 | 69.64 | 77.33 | 74.22 | |
Fujii + [17] | 78.01 | 72.92 | 76.48 | 75.81 | |
SSGER | 85.38 | 84.49 | 60.89 | 78.51 |
Label Rate of Training Set | Semi-Supervised GER on the GAF3 Dataset | ||||
---|---|---|---|---|---|
Method | Positive | Neutral | Negative | Overall | |
5% | ResNet-50 | 80.54 | 64.99 | 43.62 | 65.19 |
SSGER | 83.40 | 63.38 | 66.69 | 72.37 | |
10% | ResNet-50 | 72.70 | 72.81 | 52.72 | 67.07 |
SSGER | 82.60 | 71.64 | 64.66 | 74.07 | |
30% | ResNet-50 | 85.46 | 62.14 | 58.81 | 70.57 |
SSGER | 83.23 | 66.81 | 72.38 | 74.99 | |
100% | ResNet-50 | 86.89 | 61.62 | 67.75 | 73.52 |
Fujii + [39] | 72.12 | 69.51 | 71.52 | 71.05 | |
Quach + [40] | - | - | - | 74.18 | |
Fujii + [17] | 78.42 | 71.19 | 73.40 | 74.34 | |
SSGER | 79.85 | 76.61 | 73.44 | 77.01 |
Label Rate of Training Set | Semi-Supervised GER on the GroupEmoW Dataset | ||||
---|---|---|---|---|---|
Method | Positive | Neutral | Negative | Overall | |
5% | ResNet-50 | 81.33 | 84.62 | 65.89 | 78.16 |
SSGER | 89.31 | 83.81 | 81.67 | 85.53 | |
10% | ResNet-50 | 85.84 | 82.39 | 74.25 | 81.62 |
SSGER | 92.62 | 79.96 | 84.92 | 86.60 | |
30% | ResNet-50 | 87.65 | 82.39 | 79.12 | 83.70 |
SSGER | 93.52 | 80.36 | 84.92 | 87.10 | |
100% | ResNet-50 | 91.87 | 83.81 | 77.96 | 85.59 |
Khan + [12] | - | - | - | 89.36 | |
SSGER | 94.13 | 85.22 | 84.22 | 88.67 |
Label Rate | Comparison on Overall Accuracies (in %) | |||
---|---|---|---|---|
Method | GAF2 | GAF3 | GroupEmoW | |
5% | SSGER(w/o Contrastive Learning) | 69.91 | 68.74 | 82.63 |
SSGER(w/o Pseudo-Label) | 74.21 | 70.32 | 84.71 | |
SSGER(w/o WCE-Loss) | 74.72 | 72.14 | 85.27 | |
SSGER | 74.90 | 72.37 | 85.53 | |
10% | SSGER(w/o Contrastive Learning) | 70.26 | 71.58 | 84.58 |
SSGER(w/o Pseudo-Label) | 73.72 | 72.60 | 85.53 | |
SSGER(w/o WCE-Loss) | 75.59 | 73.72 | 86.09 | |
SSGER | 75.74 | 74.07 | 86.60 | |
30% | SSGER(w/o Contrastive Learning) | 74.11 | 72.02 | 85.90 |
SSGER(w/o Pseudo-Label) | 75.69 | 74.25 | 86.47 | |
SSGER(w/o WCE-Loss) | 76.68 | 74.74 | 87.04 | |
SSGER | 76.73 | 74.99 | 87.10 | |
100% | SSGER(w/o Contrastive Learning) | 76.14 | 75.41 | 87.85 |
SSGER | 78.51 | 77.01 | 88.67 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, J.; Wang, X.; Zhang, D.; Lee, D.-J. Semi-Supervised Group Emotion Recognition Based on Contrastive Learning. Electronics 2022, 11, 3990. https://doi.org/10.3390/electronics11233990
Zhang J, Wang X, Zhang D, Lee D-J. Semi-Supervised Group Emotion Recognition Based on Contrastive Learning. Electronics. 2022; 11(23):3990. https://doi.org/10.3390/electronics11233990
Chicago/Turabian StyleZhang, Jiayi, Xingzhi Wang, Dong Zhang, and Dah-Jye Lee. 2022. "Semi-Supervised Group Emotion Recognition Based on Contrastive Learning" Electronics 11, no. 23: 3990. https://doi.org/10.3390/electronics11233990
APA StyleZhang, J., Wang, X., Zhang, D., & Lee, D.-J. (2022). Semi-Supervised Group Emotion Recognition Based on Contrastive Learning. Electronics, 11(23), 3990. https://doi.org/10.3390/electronics11233990