Nemesis: Neural Mean Teacher Learning-Based Emotion-Centric Speaker
Abstract
:1. Introduction
- Auxiliary text-to-emotion and image-to-emotion classification tasks;
- A visual encoder to extract visual features from the input image;
- Two interconnected language models following a transformer-based encoder/decoder architecture.
- A novel approach for the image-to-emotion classification task by decreasing the texture bias of the classifier and encouraging the model towards a shape-based classification. This is because of the differing local textures in our input images (artworks) in comparison to the real world;
- Achieving a state-of-the-art performance using the Nemesis on the ArtEmis dataset. We suggest that a self-critical mean teacher learning-based approach, supervised by extra emotional signals, is a promising path towards generating more human-like, emotionally rich captions.
1.1. Related Work
1.1.1. Mean Teacher Learning
1.1.2. Knowledge Distillation
1.1.3. Self-Critical Sequence Training
1.1.4. Visual Encoding
1.1.5. Auxiliary Emotion Classification Tasks
2. Materials and Methods
2.1. Pipeline
Emotional Grounding
2.2. Architecture
2.2.1. Memory-Augmented Encoder
2.2.2. Meshed Decoder
2.3. Training Strategy
2.3.1. Cross-Entropy (XE) Training
2.3.2. SCST Fine-Tuning
2.4. Dataset
3. Results
3.1. Metrics and Implementation Details
3.2. Ablation Study
3.2.1. Visual Encoder
3.2.2. SCST Fine-Tuning
3.2.3. Image-to-Emotion Classifier
3.2.4. Emotional Grounding
3.3. Comparison with the State-of-the-Art
3.3.1. Auxiliary Classification
3.3.2. Emotion-Centric Image Captioning Task
3.3.3. Limitations
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Stefanini, M.; Cornia, M.; Baraldi, L.; Cascianelli, S.; Fiameni, G.; Cucchiara, R. From show to tell: A survey on deep learning-based image captioning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 539–559. [Google Scholar] [CrossRef]
- Jia-Yu, P.; Yang, H.-J.; Duygulu, P.; Faloutsos, C. Automatic image captioning. In Proceedings of the 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No. 04TH8763), Taipei, Taiwan, 27–30 June 2004; Volume 3. [Google Scholar]
- Ordonez, V.; Kulkarni, G.; Berg, T. Im2text: Describing images using 1 million captioned photographs. Adv. Neural Inf. Process. Syst. 2011, 24. [Google Scholar]
- Yang, Y.; Teo, C.; Daume, H., III; Aloimonos, Y. Corpus-guided sentence generation of natural images. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, 27–31 July 2011; pp. 444–454. [Google Scholar]
- Gupta, A.; Verma, Y.; Jawahar, C. Choosing linguistics over vision to describe images. In Proceedings of the AAAI Conference on Artificial Intelligence, Toronto, ON, Canada, 22–26 July 2012; Volume 26, pp. 606–612. [Google Scholar]
- Yao, B.Z.; Yang, X.; Lin, L.; Lee, M.W.; Zhu, S.-C. I2t: Image parsing to text description. Proc. IEEE 2010, 98, 1485–1508. [Google Scholar] [CrossRef]
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 652–663. [Google Scholar] [CrossRef] [PubMed]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
- Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3128–3137. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA., 4–9 December 2017; 2017; 30. [Google Scholar]
- Gan, C.; Gan, Z.; He, X.; Gao, J.; Deng, L. Stylenet: Generating attractive visual captions with styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3137–3146. [Google Scholar]
- Mathews, A.; Xie, L.; He, X. Semstyle: Learning to generate stylised image captions using unaligned text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8591–8600. [Google Scholar]
- Guo, L.; Liu, J.; Yao, P.; Li, J.; Lu, H. Mscap: Multi-style image captioning with unpaired stylized text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4204–4213. [Google Scholar]
- Zhao, W.; Wu, X.; Zhang, X. Memcap: Memorizing style knowledge for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12984–12992. [Google Scholar]
- Shuster, K.; Humeau, S.; Hu, H.; Bordes, A.; Weston, J. Engaging image captioning via personality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12516–12526. [Google Scholar]
- Lu, X.; Wang, B.; Zheng, X. Sound active attention framework for remote sensing image captioning. IEEE 2019, 58, 1985–2000. [Google Scholar] [CrossRef]
- Wang, B.; Dong, G.; Zhao, Y.; Li, R.; Cao, Q.; Chao, Y. Non-uniform attention network for multi-modal sentiment analysis. In International Conference on Multimedia Modeling; Springer: Berlin/Heidelberg, Germany, 2022; pp. 612–623. [Google Scholar]
- Achlioptas, P.; Ovsjanikov, M.; Haydarov, K.; Elhoseiny, M.; Guibas, L.J. Artemis: Affective language for visual art. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11569–11579. [Google Scholar]
- Laine, S.; Aila, T. Temporal ensembling for semi-supervised learning. arXiv 2016, arXiv:1610.02242. [Google Scholar]
- Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process.Syst. 2017, 30. [Google Scholar]
- Xu, G.; Liu, Z.; Li, X.; Loy, C.C. Knowledge distillation meets self-supervision. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 588–604. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.025312. [Google Scholar]
- Ba, J.; Caruana, R. Do deep nets really need to be deep? In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
- Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef]
- Liu, S.; Zhu, Z.; Ye, N.; Guadarrama, S.; Murphy, K. Improved image captioning via policy gradient optimization of spider. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 873–881. [Google Scholar]
- Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, Denver, CO, USA, 29 November–4 December 1999; 12. [Google Scholar]
- Ranzato, M.; Chopra, S.; Auli, M.; Zaremba, W. Sequence level training with recurrent neural networks. arXiv 2015, arXiv:1511.06732. [Google Scholar]
- Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7008–7024. [Google Scholar]
- Vedantam, R.; Zitnick, C.L.; Parikh, D. CIDEr: Consensus-based image description evaluation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
- Yang, L.; Shang, S.; Liu, Y.; Peng, Y.; He, L. Variational transformer: A framework beyond the trade-off between accuracy and diversity for image captioning. arXiv 2022, arXiv:2205.14458. [Google Scholar]
- Yao, T.; Pan, Y.; Li, Y.; Qiu, Z.; Mei, T. Boosting image captioning with attributes. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4894–4902. [Google Scholar]
- Shen, S.; Li, L.H.; Tan, H.; Bansal, M.; Rohrbach, A.; Chang, K.-W.; Yao, Z.; Keutzer, K. How much can CLIP benefit vision-and-language tasks? In Proceedings of the International Conference on Learning Representations. Virtual, 25–29 April 2022. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–14 August 2021; pp. 8748–8763. [Google Scholar]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv 2022, arXiv:2201.12086. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Geirhos, R.; Rubisch, P.; Michaelis, C.; Bethge, M.; Wichmann, F.A.; Brendel, W. Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Barraco, M.; Stefanini, M.; Cornia, M.; Cascianelli, S.; Baraldi, L.; Cucchiara, R. CaMEL: Mean Teacher Learning for Image Captioning. In Proceedings of the International Conference on Pattern Recognition, Montréal, QC, Canada, 21–25 August 2022; pp. 4087–4094. [Google Scholar]
- Wang, Y.; Albrecht, C.M.; Zhu, X.X. Self-Supervised Vision Transformers for Joint SAR-Optical Representation Learning. In Proceedings of the IGARSS 2022–2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 139–142. [Google Scholar]
- MCornia; Stefanini, M.; Baraldi, L.; Cucchiara, R. Meshed-memory transformer for image captioning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10578–10587. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Stroudsburg, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
- Lavie, A.; Agarwal, A. Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic, 23 June 2007; pp. 228–231. [Google Scholar]
- Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
- Sennrich, R.; Haddow, B.; Birch, A. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; pp. 1715–1725. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
- Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D.A.; et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 2017, 123, 32–73. [Google Scholar] [CrossRef] [Green Version]
- Tan, M.; Le, Q.V. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 6105–6114. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
Model | Visual Encoder | Teacher Model | ||||||
---|---|---|---|---|---|---|---|---|
B-1 | B-2 | B-3 | B-4 | M | R | C | ||
Faster R-CNN | 0.503 | 0.277 | 0.154 | 0.089 | 0.141 | 0.278 | 0.093 | |
Nemesis | CLIP-RN50 × 16 | 0.539 | 0.311 | 0.178 | 0.106 | 0.141 | 0.294 | 0.130 |
0.526 | 0.304 | 0.175 | 0.105 | 0.138 | 0.291 | 0.127 | ||
Faster R-CNN | 0.458 | 0.233 | 0.121 | 0.066 | 0.118 | 0.242 | 0.070 | |
EGNemesis | CLIP-RN50 × 16 | 0.475 | 0.252 | 0.136 | 0.076 | 0.124 | 0.254 | 0.095 |
0.470 | 0.252 | 0.137 | 0.077 | 0.123 | 0.255 | 0.099 |
Model | Visual Encoder | Student Model | ||||||
---|---|---|---|---|---|---|---|---|
B-1 | B-2 | B-3 | B-4 | M | R | C | ||
Faster R-CNN | 0.498 | 0.273 | 0.151 | 0.086 | 0.130 | 0.276 | 0.087 | |
Nemesis | CLIP-RN50 × 16 | 0.532 | 0.304 | 0.172 | 0.102 | 0.137 | 0.290 | 0.120 |
0.509 | 0.290 | 0.165 | 0.097 | 0.137 | 0.281 | 0.116 | ||
Faster R-CNN | 0.455 | 0.233 | 0.122 | 0.066 | 0.114 | 0.243 | 0.066 | |
EGNemesis | CLIP-RN50 × 16 | 0.472 | 0.251 | 0.134 | 0.076 | 0.124 | 0.254 | 0.095 |
0.479 | 0.260 | 0.141 | 0.080 | 0.129 | 0.262 | 0.099 |
Training Phase | Visual Encoder | Encoding Mode | Data Parallelism | GPU Type | Time Per Epoch |
---|---|---|---|---|---|
Faster R-CNN | Offline | - | NVIDIA P100 | 1 h | |
XE | CLIP-RN50 × 16 | Online | - | NVIDIA V100 | 4 h |
Online | ✓ | NVIDIA V100 | 1 h | ||
Faster R-CNN | - | - | - | - | |
SCST | CLIP-RN50 × 16 | Online | - | NVIDIA V100 | 7 h |
Online | - | NVIDIA V100 | 7 h |
Metric | ||||
---|---|---|---|---|
BLEU-1 | 0.539 | 0.711 | 0.479 | 0.700 |
BLEU-2 | 0.311 | 0.406 | 0.260 | 0.403 |
BLEU-3 | 0.178 | 0.211 | 0.141 | 0.214 |
BLEU-4 | 0.106 | 0.113 | 0.080 | 0.115 |
METEOR | 0.141 | 0.166 | 0.129 | 0.165 |
ROUGE-L | 0.294 | 0.341 | 0.262 | 0.336 |
CIDEr | 0.130 | 0.219 | 0.099 | 0.224 |
Metric | ||
---|---|---|
BLEU-1 | 0.466 | 0.479 |
BLEU-2 | 0.251 | 0.260 |
BLEU-3 | 0.137 | 0.141 |
BLEU-4 | 0.077 | 0.080 |
METEOR | 0.128 | 0.129 |
ROUGE-L | 0.253 | 0.262 |
CIDEr | 0.093 | 0.099 |
Metric | ||||||||
---|---|---|---|---|---|---|---|---|
BLEU-1 | 0.536 | 0.520 | 0.507 | 0.511 | 0.539 | 0.711 | 0.479 | 0.700 |
BLEU-2 | 0.290 | 0.280 | 0.282 | 0.282 | 0.311 | 0.406 | 0.260 | 0.403 |
BLEU-3 | 0.155 | 0.146 | 0.159 | 0.154 | 0.178 | 0.211 | 0.141 | 0.241 |
BLEU-4 | 0.087 | 0.079 | 0.095 | 0.090 | 0.106 | 0.113 | 0.080 | 0.115 |
METEOR | 0.142 | 0.134 | 0.140 | 0.137 | 0.141 | 0.166 | 0.129 | 0.165 |
ROUGE-L | 0.297 | 0.294 | 0.280 | 0.286 | 0.294 | 0.341 | 0.262 | 0.336 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yousefi, A.; Passi, K. Nemesis: Neural Mean Teacher Learning-Based Emotion-Centric Speaker. Algorithms 2023, 16, 97. https://doi.org/10.3390/a16020097
Yousefi A, Passi K. Nemesis: Neural Mean Teacher Learning-Based Emotion-Centric Speaker. Algorithms. 2023; 16(2):97. https://doi.org/10.3390/a16020097
Chicago/Turabian StyleYousefi, Aryan, and Kalpdrum Passi. 2023. "Nemesis: Neural Mean Teacher Learning-Based Emotion-Centric Speaker" Algorithms 16, no. 2: 97. https://doi.org/10.3390/a16020097
APA StyleYousefi, A., & Passi, K. (2023). Nemesis: Neural Mean Teacher Learning-Based Emotion-Centric Speaker. Algorithms, 16(2), 97. https://doi.org/10.3390/a16020097