Multi-Modal Sarcasm Detection with Sentiment Word Embedding
Abstract
:1. Introduction
- A pioneering model is presented that utilizes an external sentiment [12] lexicon to score modality segmentation and merges the resulting sentiment vectors for multi-modal irony detection;
- A mechanism is proposed for fusing external knowledge with all modalities to reduce noise in inconsistency measurement;
- We perform extensive comparative experiments with other baseline models on the Twitter dataset, and our model outperforms all others, achieving state-of-the-art results.
2. Materials and Methods
2.1. Related Work
2.2. Methods
2.3. Text Image Modality and Commonsense Representation
2.4. Sentiment Word Embedding (SWE)
2.5. Multi-Head Attention Fusion Module
3. Results
3.1. Experimental Settings
3.2. Comparison Models
- (1)
- Image-modality methods: These models leverage solely visual information for sarcasm detection. For example, Image employs ResNet to train a sarcasm classifier, while ViT (Dosovitskiy et al. [24]) uses ViTs to identify sarcasm, specifically through the “[class]” token representations;
- (2)
- Text-modality methods: These models rely exclusively on textual information, encompassing TextCNN (Kim [32]), a CNN-based deep learning model for text classification; Bi-LSTM, a bidirectional LSTM network for text classification; SIARN (Tay et al. [2]), which employs inner attention to detect sarcasm in the text; SMSD (Xiong et al. [33]), which utilizes self-matching networks to capture textual inconsistencies; and BERT (Devlin et al. [34]), which accepts “[CLS] text [SEP]” as input for pre-training;
- (3)
- Multi-modal methods: These models utilize both textual and visual information to detect sarcasm. For instance, Cai et al. [31] proposed the HFM method, which employs a hierarchical multi-modal feature fusion model for multi-modal sarcasm detection; Net D&R (Xu et al. [35]), which employs a decomposition and relation network for cross-modal modeling of modal contrast and semantic association; Res-BERT (Pan et al. [17]), which concatenates image features and BERT-based text features to predict sarcasm; Att-BERT (Pan et al. [17]), which explores inter-modal attention and co-attention for modeling incongruity in multi-modal sarcasm detection; and InCrossMGs (Liang et al. [6]), a graph-based approach that harnesses sarcasm relations from intra- and inter-modal perspectives.
3.3. Main Results
3.4. Additional Dataset Experiments
4. Discussion
4.1. Ablation Study
4.2. Multi-Modal Experiements Analysis
4.3. Case Study
4.4. Error Analysis
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
NLTK | Natural Language Toolkit |
ViT | Visual Transformer |
LSTM | Long Short-Term Memory |
Bi-LSTM | Bidirectional Long Short-Term Memory |
SWE | Sentiment Word Embedding |
References
- Gibbs, R.W. On the psycholinguistics of sarcasm. J. Exp. Psychol. Gen. 1986, 115, 3. [Google Scholar] [CrossRef]
- Tay, Y.; Tuan, L.A.; Hui, S.C.; Su, J. Reasoning with sarcasm by reading in-between. arXiv 2018, arXiv:1805.02856. [Google Scholar]
- Gupta, S.; Shah, A.; Shah, M.; Syiemlieh, L.; Maurya, C. FiLMing Multimodal Sarcasm Detection with Attention. In Proceedings of the Neural Information Processing: 28th International Conference, ICONIP 2021, Bali, Indonesia, 8–12 December 2021; Proceedings, Part V 28. Springer: Berlin/Heidelberg, Germany, 2021; pp. 178–186. [Google Scholar]
- Yao, F.; Sun, X.; Yu, H.; Zhang, W.; Liang, W.; Fu, K. Mimicking the brain’s cognition of sarcasm from multidisciplines for Twitter sarcasm detection. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 228–242. [Google Scholar] [CrossRef] [PubMed]
- Wen, Z.; Wang, R.; Luo, X.; Wang, Q.; Liang, B.; Du, J.; Yu, X.; Gui, L.; Xu, R. Multi-perspective contrastive learning framework guided by sememe knowledge and label information for sarcasm detection. Int. J. Mach. Learn. Cybern. 2023, 14, 4119–4134. [Google Scholar] [CrossRef]
- Liang, B.; Lou, C.; Li, X.; Gui, L.; Yang, M.; Xu, R. Multi-modal sarcasm detection with interactive in-modal and cross-modal graphs. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 4707–4715. [Google Scholar]
- Jiang, D.; Liu, H.; Tu, G.; Wei, R. Window transformer for dialogue document: A joint framework for causal emotion entailment. Int. J. Mach. Learn. Cybern. 2023, 14, 2697–2707. [Google Scholar] [CrossRef]
- Qin, L.; Huang, S.; Chen, Q.; Cai, C.; Zhang, Y.; Liang, B.; Che, W.; Xu, R. MMSD2. 0: Towards a Reliable Multi-modal Sarcasm Detection System. arXiv 2023, arXiv:2307.07135. [Google Scholar]
- Zhao, W.; Zhao, Y.; Li, Z.; Qin, B. Knowledge-bridged causal interaction network for causal emotion entailment. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 14020–14028. [Google Scholar]
- Loper, E.; Bird, S. Nltk: The natural language toolkit. arXiv 2002, arXiv:cs/0205028. [Google Scholar]
- Cambria, E.; Hussain, A.; Cambria, E.; Hussain, A. SenticNet. In Sentic Computing: A Common-Sense-Based Framework for Concept-Level Sentiment Analysis; Springer: Berlin/Heidelberg, Germany, 2015; pp. 23–71. [Google Scholar]
- Liang, B.; Li, X.; Gui, L.; Fu, Y.; He, Y.; Yang, M.; Xu, R. Few-shot aspect category sentiment analysis via meta-learning. ACM Trans. Inf. Syst. 2023, 41, 1–31. [Google Scholar] [CrossRef]
- Liu, H.; Wang, W.; Li, H. Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement. arXiv 2022, arXiv:2210.03501. [Google Scholar]
- Cai, C.; Zhao, Q.; Xu, R.; Qin, B. Multimodal Dialogue Understanding via Holistic Modeling and Sequence Labeling. In Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing, Foshan, China, 12–15 October 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 399–411. [Google Scholar]
- Li, J.; Pan, H.; Lin, Z.; Fu, P.; Wang, W. Sarcasm detection with commonsense knowledge. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3192–3201. [Google Scholar] [CrossRef]
- Veale, T.; Hao, Y. Detecting ironic intent in creative comparisons. In Proceedings of the ECAI 2010, Lisbon, Portugal, 16–20 August 2010; IOS Press: Amsterdam, The Netherlands, 2010; pp. 765–770. [Google Scholar]
- Pan, H.; Lin, Z.; Fu, P.; Qi, Y.; Wang, W. Modeling intra and inter-modality incongruity for multi-modal sarcasm detection. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16–20 November 2020; pp. 1383–1392. [Google Scholar]
- Liang, B.; Lou, C.; Li, X.; Yang, M.; Gui, L.; He, Y.; Pei, W.; Xu, R. Multi-modal sarcasm detection via cross-modal graph convolutional network. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 1767–1777. [Google Scholar]
- Yin, S.; Fu, C.; Zhao, S.; Li, K.; Sun, X.; Xu, T.; Chen, E. A Survey on Multimodal Large Language Models. arXiv 2023, arXiv:2306.13549. [Google Scholar]
- Li, Y.; Zhang, Y.; Yang, Y.; Xu, R. A Generative Model for Structured Sentiment Analysis. In Proceedings of the International Conference on AI and Mobile Services, Hawaii, HI, USA, 23 September 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 28–38. [Google Scholar]
- Zhao, S.; Jiang, H.; Tao, H.; Zha, R.; Zhang, K.; Xu, T.; Chen, E. PEDM: A Multi-task Learning Model for Persona-aware Emoji-embedded Dialogue Generation. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–21. [Google Scholar] [CrossRef]
- Lu, X.; Zhao, W.; Zhao, Y.; Qin, B.; Zhang, Z.; Wen, J. A Topic-Enhanced Approach for Emotion Distribution Forecasting in Conversations. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6077–6086. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Tu, G.; Liang, B.; Jiang, D.; Xu, R. Sentiment-Emotion-and Context-guided Knowledge Selection Framework for Emotion Recognition in Conversations. IEEE Trans. Affect. Comput. 2022, 14, 1803–1816. [Google Scholar] [CrossRef]
- Chen, M.; Lu, X.; Xu, T.; Li, Y.; Zhou, J.; Dou, D.; Xiong, H. Towards table-to-text generation with pretrained language model: A table structure understanding and text deliberating approach. arXiv 2023, arXiv:2301.02071. [Google Scholar]
- Wu, Y.; Zhao, Y.; Yang, H.; Chen, S.; Qin, B.; Cao, X.; Zhao, W. Sentiment word aware multimodal refinement for multimodal sentiment analysis with ASR errors. arXiv 2022, arXiv:2203.00257. [Google Scholar]
- Wen, J.; Jiang, D.; Tu, G.; Liu, C.; Cambria, E. Dynamic interactive multiview memory network for emotion recognition in conversation. Inf. Fusion 2023, 91, 123–133. [Google Scholar] [CrossRef]
- Jiang, D.; Liu, H.; Wei, R.; Tu, G. CSAT-FTCN: A Fuzzy-Oriented Model with Contextual Self-attention Network for Multimodal Emotion Recognition. Cogn. Comput. 2023, 15, 1082–1091. [Google Scholar] [CrossRef]
- Jiang, D.; Wei, R.; Liu, H.; Wen, J.; Tu, G.; Zheng, L.; Cambria, E. A Multitask Learning Framework for Multimodal Sentiment Analysis. In Proceedings of the 2021 International Conference on Data Mining Workshops (ICDMW), Virtual, 7–10 December 2021; pp. 151–157. [Google Scholar] [CrossRef]
- Cai, Y.; Cai, H.; Wan, X. Multi-modal sarcasm detection in twitter with hierarchical fusion model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 2506–2515. [Google Scholar]
- Chen, Y. Convolutional Neural Network for Sentence Classification. Master’s Thesis, University of Waterloo, Waterloo, ON, Canada, 2015. [Google Scholar]
- Xiong, T.; Zhang, P.; Zhu, H.; Yang, Y. Sarcasm detection with self-matching networks and low-rank bilinear pooling. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 2115–2124. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Xu, N.; Zeng, Z.; Mao, W. Reasoning with multimodal sarcastic tweets via modeling cross-modality contrast and semantic association. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3777–3786. [Google Scholar]
- Maity, K.; Jha, P.; Saha, S.; Bhattacharyya, P. A multitask framework for sentiment, emotion and sarcasm aware cyberbullying detection from multi-modal code-mixed memes. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–25 July 2022; pp. 1739–1749. [Google Scholar]
- Zhang, M.; Zhang, Y.; Fu, G. Tweet sarcasm detection using deep neural network. In Proceedings of the COLING 2016, The 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–16 December 2016; pp. 2449–2460. [Google Scholar]
- Babanejad, N.; Davoudi, H.; An, A.; Papagelis, M. Affective and contextual embedding for sarcasm detection. In Proceedings of the 28th International Conference on Computational Linguistics, Online, 8–13 December 2020; pp. 225–243. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Non Sarcasm | Sarcasm | Total | |
---|---|---|---|
Training set | 8642 | 11,174 | 19,816 |
Val set | 959 | 1451 | 2410 |
Test set | 959 | 1450 | 2409 |
Modality | Method | Acc (%) | Pre (%) | Rec (%) | F1 (%) |
---|---|---|---|---|---|
image | Image [31] | 64.76 | 54.41 | 70.80 | 61.53 |
ViT [24] | 67.83 | 57.93 | 70.07 | 63.43 | |
text | TextCNN [32] | 80.03 | 74.29 | 76.39 | 75.32 |
Bi-LSTM | 81.89 | 76.64 | 78.40 | 77.50 | |
SIARN [2] | 80.55 | 75.56 | 75.68 | 75.61 | |
SMSD [33] | 80.88 | 76.48 | 75.16 | 75.80 | |
BERT [34] | 83.83 | 78.70 | 82.26 | 80.21 | |
image + text | HFM [31] | 83.39 | 76.54 | 84.17 | 80.16 |
Net D&R [35] | 84.01 | 77.95 | 83.39 | 80.58 | |
Res-BERT [17] | 84.79 | 77.78 | 84.12 | 80.86 | |
Att-BERT [17] | 86.04 | 78.61 | 83.28 | 80.92 | |
InCrossMGs [6] | 86.08 | 81.36 | 84.34 | 82.80 | |
Ours | 86.08 | 82.11 | 84.77 | 83.42 |
Modality | Model | Acc (%) | Pre (%) | Rec (%) | F1 (%) | Macro-F1 (%) |
---|---|---|---|---|---|---|
Text | TextCNN | 54.22 | 62.14 | 41.17 | 49.53 | 53.92 |
TextCNN-LSTM | 57.75 | 59.14 | 53.83 | 56.36 | 57.70 | |
Image | RestNet | 55.66 | 45.53 | 66.97 | 54.21 | 54.57 |
Text + Image | HFM | 55.43 | 58.50 | 60.97 | 59.71 | 54.92 |
Maity [36] | 58.70 | 64.61 | 56.37 | 60.12 | 58.64 | |
Ours | 59.51 | 61.67 | 63.03 | 62.34 | 59.31 |
Acc (%) | Pre (%) | Rec (%) | F1 (%) | |
---|---|---|---|---|
Ours | 86.08 | 82.11 | 84.77 | 83.42 |
fusion | 80.35 | 79.63 | 80.65 | 80.14 |
emotion-fusion | 81.32 | 80.52 | 82.42 | 81.46 |
emotion | 81.68 | 79.88 | 84.67 | 82.21 |
Modality | Acc (%) | Pre (%) | Rec (%) | F1 (%) |
---|---|---|---|---|
Text | 83.36 | 80.09 | 79.46 | 79.78 |
Image | 72.44 | 67.24 | 64.93 | 66.06 |
Text + Image | 86.08 | 82.11 | 84.77 | 83.42 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Fu, H.; Liu, H.; Wang, H.; Xu, L.; Lin, J.; Jiang, D. Multi-Modal Sarcasm Detection with Sentiment Word Embedding. Electronics 2024, 13, 855. https://doi.org/10.3390/electronics13050855
Fu H, Liu H, Wang H, Xu L, Lin J, Jiang D. Multi-Modal Sarcasm Detection with Sentiment Word Embedding. Electronics. 2024; 13(5):855. https://doi.org/10.3390/electronics13050855
Chicago/Turabian StyleFu, Hao, Hao Liu, Hongling Wang, Linyan Xu, Jiali Lin, and Dazhi Jiang. 2024. "Multi-Modal Sarcasm Detection with Sentiment Word Embedding" Electronics 13, no. 5: 855. https://doi.org/10.3390/electronics13050855
APA StyleFu, H., Liu, H., Wang, H., Xu, L., Lin, J., & Jiang, D. (2024). Multi-Modal Sarcasm Detection with Sentiment Word Embedding. Electronics, 13(5), 855. https://doi.org/10.3390/electronics13050855