Multimodal Sentiment Analysis Representations Learning via Contrastive Learning with Condense Attention Fusion
Abstract
:1. Introduction
- We propose a CNN and Transformer-based approach for multimodal sentiment analysis that aims to extract more comprehensive text and image features. Our method uses CNN to extract local features and Transformer to capture global features, which are then combined to obtain a better representation of the data.
- We employ a supervised contrastive Learning approach with data augmentation to improve the performance of our method. By using the supervised contrastive learning loss, we encourage the embedding vectors of the same class to be closer to each other and those of different classes to be farther apart. This approach better characterizes the intra-class similarity and improves the robustness of our method.
2. Related Work
2.1. Multimodal Learning
2.2. Multimodal Sentiment Analysis
3. Materials and Methods
3.1. Representation Learning Framework
3.1.1. Data Augmentation
3.1.2. Encoder Network
3.1.3. Multi-Layer Fusion Convolution Neural Network
3.2. Sentiment Classification Contrastive Loss
3.2.1. Supervised Contrastive Losses
3.2.2. Sentiment Classification Loss
3.2.3. Sentiment Classification Supervised Contrastive Loss
- Samples with the same label and enhanced samples are the molecular weight of the SupConLoss formula for the same batch of original pattern pairs. The supervised contrastive learning loss mechanism stimulates the encoder to provide a closer representation of the same class so that samples of the same class are more closely combined in the embedding space [46].
- SCSupCon optimizes the separation distance in the normalized hypersphere using the angle–arc relationship. As a result, SCSupCon may show a more pronounced separation between the nearest classes on the Loss sphere’s surface.
4. Experiment
4.1. Data Processing
4.2. Comparative Experiments
4.3. Ablation Experiments
4.4. Data Visualization
4.5. Case Study
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Kaur, R.; Kautish, S. Multimodal sentiment analysis: A survey and comparison. In Research Anthology on Implementing Sentiment Analysis Across Multiple Disciplines; IGI Global: Hershey, PA, USA, 2022; pp. 1846–1870. [Google Scholar]
- Balazs, J.A.; Velásquez, J.D. Opinion mining and information fusion: A survey. Inf. Fusion 2016, 27, 95–110. [Google Scholar] [CrossRef]
- Ke, Z.; Sheng, J.; Li, Z.; Silamu, W.; Guo, Q. Knowledge-guided sentiment analysis via learning from natural language explanations. IEEE Access 2021, 9, 3570–3578. [Google Scholar] [CrossRef]
- Ma, D.; Li, S.; Zhang, X.; Wang, H. Interactive attention networks for aspect-level sentiment classification. arXiv 2017, arXiv:1709.00893. [Google Scholar]
- Xu, Q.; Zhu, L.; Dai, T.; Yan, C. Aspect-based sentiment classification with multi-attention network. Neurocomputing 2020, 388, 135–143. [Google Scholar] [CrossRef]
- Jindal, S.; Singh, S. Image sentiment analysis using deep convolutional neural networks with domain specific fine tuning. In Proceedings of the 2015 International Conference on Information Processing (ICIP), Pune, India, 16–19 December 2015; pp. 447–451. [Google Scholar]
- Yang, J.; She, D.; Sun, M.; Cheng, M.M.; Rosin, P.L.; Wang, L. Visual sentiment prediction based on automatic discovery of affective regions. IEEE Trans. Multimed. 2018, 20, 2513–2525. [Google Scholar] [CrossRef] [Green Version]
- Xu, N. Analyzing multimodal public sentiment based on hierarchical semantic attentional network. In Proceedings of the 2017 IEEE International Conference on Intelligence and Security Informatics (ISI), Beijing, China, 22–24 July 2017; pp. 152–154. [Google Scholar]
- Xu, N.; Mao, W. Multisentinet: A deep semantic network for multimodal sentiment analysis. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; pp. 2399–2402. [Google Scholar]
- Yu, Y.; Lin, H.; Meng, J.; Zhao, Z. Visual and textual sentiment analysis of a microblog using deep convolutional neural networks. Algorithms 2016, 9, 41. [Google Scholar] [CrossRef] [Green Version]
- You, Q.; Luo, J.; Jin, H.; Yang, J. Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, San Francisco, CA, USA, 22–25 February 2016; pp. 13–22. [Google Scholar]
- Chen, X.; Wang, Y.; Liu, Q. Visual and textual sentiment analysis using deep fusion convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 1557–1561. [Google Scholar]
- You, Q.; Cao, L.; Jin, H.; Luo, J. Robust visual-textual sentiment analysis: When attention meets tree-structured recursive neural networks. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 1008–1017. [Google Scholar]
- Yang, X.; Feng, S.; Wang, D.; Zhang, Y. Image-text multimodal emotion classification via multi-view attentional network. IEEE Trans. Multimed. 2020, 23, 4014–4026. [Google Scholar] [CrossRef]
- Plummer, B.A.; Wang, L.; Cervantes, C.M.; Caicedo, J.C.; Hockenmaier, J.; Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2641–2649. [Google Scholar]
- Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6904–6913. [Google Scholar]
- Zadeh, A.B.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.P. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2236–2246. [Google Scholar]
- Niu, T.; Zhu, S.; Pang, L.; Saddik, A.E. Sentiment analysis on multi-view social data. In Proceedings of the International Conference on Multimedia Modeling; Springer: Cham, Switzerland, 2016; pp. 15–27. [Google Scholar]
- Cai, Y.; Cai, H.; Wan, X. Multi-modal sarcasm detection in twitter with hierarchical fusion model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 2506–2515. [Google Scholar]
- Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; Ng, A.Y. Multimodal deep learning. In Proceedings of the ICML, Bellevue, WA, USA, 28 June–2 July 2011. [Google Scholar]
- Yu, Y.; Tang, S.; Aizawa, K.; Aizawa, A. Category-based deep CCA for fine-grained venue discovery from multimodal data. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 1250–1258. [Google Scholar] [CrossRef] [Green Version]
- Song, J.; Guo, Y.; Gao, L.; Li, X.; Hanjalic, A.; Shen, H.T. From deterministic to generative: Multimodal stochastic RNNs for video captioning. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 3047–3058. [Google Scholar] [CrossRef] [Green Version]
- Morency, L.P.; Mihalcea, R.; Doshi, P. Towards multimodal sentiment analysis: Harvesting opinions from the web. In Proceedings of the 13th International Conference on Multimodal Interfaces, Alicante, Spain, 14–18 November 2011; pp. 169–176. [Google Scholar]
- Valada, A.; Oliveira, G.L.; Brox, T.; Burgard, W. Deep multispectral semantic scene understanding of forested environments using multimodal fusion. In Proceedings of the International Symposium on Experimental Robotics, Nagasaki, Japan, 3–8 October 2016; Springer: Cham, Switzerland, 2016; pp. 465–477. [Google Scholar]
- Majumder, N.; Hazarika, D.; Gelbukh, A.; Cambria, E.; Poria, S. Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl.-Based Syst. 2018, 161, 124–133. [Google Scholar] [CrossRef] [Green Version]
- Xi, C.; Lu, G.; Yan, J. Multimodal sentiment analysis based on multi-head attention mechanism. In Proceedings of the 4th International Conference on Machine Learning and Soft Computing, Haiphong City, Vietnam, 17–19 January 2020; pp. 34–39. [Google Scholar]
- Li, Z.; Li, X.; Sheng, J.; Slamu, W. AgglutiFiT: Efficient low-resource agglutinative language model fine-tuning. IEEE Access 2020, 8, 148489–148499. [Google Scholar] [CrossRef]
- Sharma, C.; Bhageria, D.; Scott, W.; Pykl, S.; Das, A.; Chakraborty, T.; Pulabaigari, V.; Gamback, B. SemEval-2020 Task 8: Memotion Analysis—The Visuo-Lingual Metaphor! arXiv 2020, arXiv:2008.03781. [Google Scholar]
- Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou, J.; Yang, K. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3718–3727. [Google Scholar]
- Zadeh, A.; Cao, Y.S.; Hessner, S.; Liang, P.P.; Poria, S.; Morency, L.P. CMU-MOSEAS: A multimodal language dataset for Spanish, Portuguese, German and French. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, Online, 16–20 November 2020; NIH Public Access: Stroudsburg, PA, USA, 2020; Volume 2020, p. 1801. [Google Scholar]
- Lopes, V.; Gaspar, A.; Alexandre, L.A.; Cordeiro, J. An AutoML-based approach to multimodal image sentiment analysis. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–9. [Google Scholar]
- Ramamoorthy, S.; Gunti, N.; Mishra, S.; Suryavardan, S.; Reganti, A.; Patwa, P.; Das, A.; Chakraborty, T.; Sheth, A.; Ekbal, A.; et al. Memotion 2: Dataset on Sentiment and Emotion Analysis of Memes. In Proceedings of the De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, CEURc, Vancouver, BC, Canada, 22 February–1 March 2022. [Google Scholar]
- Poria, S.; Cambria, E.; Hazarika, D.; Majumder, N.; Zadeh, A.; Morency, L.P. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 873–883. [Google Scholar]
- Huang, F.; Zhang, X.; Zhao, Z.; Xu, J.; Li, Z. Image–text sentiment analysis via deep multimodal attentive fusion. Knowl. Based Syst. 2019, 167, 26–37. [Google Scholar] [CrossRef]
- Truong, Q.T.; Lauw, H.W. Vistanet: Visual aspect attention network for multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 305–312. [Google Scholar]
- Xu, J.; Li, Z.; Huang, F.; Li, C.; Philip, S.Y. Social image sentiment analysis by exploiting multimodal content and heterogeneous relations. IEEE Trans. Ind. Inform. 2020, 17, 2974–2982. [Google Scholar] [CrossRef]
- Wu, Y.; Lin, Z.; Zhao, Y.; Qin, B.; Zhu, L.N. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online Event, 1–6 August 2021; pp. 4730–4738. [Google Scholar]
- Tan, Y.; Sun, Z.; Duan, F.; Solé-Casals, J.; Caiafa, C.F. A multimodal emotion recognition method based on facial expressions and electroencephalography. Biomed. Signal Process. Control 2021, 70, 103029. [Google Scholar] [CrossRef]
- Ali, F.; Ali, A.; Imran, M.; Naqvi, R.A.; Siddiqi, M.H.; Kwak, K.S. Traffic accident detection and condition analysis based on social networking data. Accid. Anal. Prev. 2021, 151, 105973. [Google Scholar] [CrossRef]
- Li, Z.; Xu, B.; Zhu, C.; Zhao, T. CLMLF: A Contrastive Learning and Multi-Layer Fusion Method for Multimodal Sentiment Detection. arXiv 2022, arXiv:2204.05515. [Google Scholar]
- Li, Z.; Mak, M.W.; Meng, H.M.L. Discriminative Speaker Representation via Contrastive Learning with Class-Aware Attention in Angular Space. arXiv 2022, arXiv:2210.16622. [Google Scholar]
- Xie, Q.; Dai, Z.; Hovy, E.; Luong, T.; Le, Q. Unsupervised data augmentation for consistency training. Adv. Neural Inf. Process. Syst. 2020, 33, 6256–6268. [Google Scholar]
- Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 702–703. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 5998–6008. [Google Scholar]
- Sheng, J.; Zhang, Y.; Cai, J.; Lam, S.K.; Li, Z.; Zhang, J.; Teng, X. Multi-view Contrastive Learning with Additive Margin for Adaptive Nasopharyngeal Carcinoma Radiotherapy Prediction. arXiv 2022, arXiv:2210.15201. [Google Scholar]
- Li, Z.; Mak, M.W. Speaker Representation Learning via Contrastive Loss with Maximal Speaker Separability. In Proceedings of the 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Chiang Mai, Thailand, 7–10 November 2022; pp. 962–967. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Yang, X.; Feng, S.; Zhang, Y.; Wang, D. Multimodal sentiment detection based on multi-channel graph neural networks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual Event, 1–6 August 2021; pp. 328–339. [Google Scholar]
- Huang, L.; Ma, D.; Li, S.; Zhang, X.; Wang, H. Text level graph neural network for text classification. arXiv 2019, arXiv:1910.02356. [Google Scholar]
- Xu, N.; Zeng, Z.; Mao, W. Reasoning with multimodal sarcastic tweets via modeling cross-modality contrast and semantic association. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3777–3786. [Google Scholar]
- Li, X.; Li, Z.; Sheng, J.; Slamu, W. Low-resource text classification via cross-lingual language model fine-tuning. In Chinese Computational Linguistics: Proceedings of the 19th China National Conference, CCL 2020, Hainan, China, 30 October–1 November 2020; Springer: Cham, Switzerland, 2020; pp. 231–246. [Google Scholar]
Hyperparameter | MVSA-Single | MVSA-Multiple | HFM |
---|---|---|---|
Text data augmentation | back-translation | back-translation | back-translation |
Image data augmentation | RandAugment | RandAugment | RandAugment |
Emotional polarity | 3 | 3 | 2 |
Loss | Conloss/SCSupConloss | Conloss/SCSupConloss | Conloss/SCSupConloss |
Contrasting learning styles | Self-Supervised/Supervised | Self-Supervised/Supervised | Self-Supervised/Supervised |
Text Encoder | BERT | BERT | BERT |
Image Encoder | Resnet-50 | Resnet-50 | Resnet-50 |
Fuse model epoch | 20 | 20 | 20 |
Epoch | 30 | 30 | 30 |
Integration method | MLF/MLFC | MLF/MLFC | MLF/MLFC |
Optimizer | Adams | Adams | Adams |
Learning rate | 2 × 10 | 2 × 10 | 2 × 10 |
Batch | 32/32 | 64/64 | 128/48 |
Modality | Model | MVSA-Single | MVSA-Multiple | ||
---|---|---|---|---|---|
Acc (%) | F1 (%) | Acc (%) | F1 (%) | ||
Text | CNN | 68.19 | 55.90 | 65.64 | 57.66 |
BiLSTM | 70.12 | 65.06 | 67.90 | 67.90 | |
BERT | 71.11 | 69.70 | 67.59 | 66.24 | |
TGNN | 70.34 | 65.94 | 69.67 | 61.80 | |
Image | ResNet-50 | 64.67 | 61.55 | 61.88 | 60.98 |
OSDA | 66.75 | 66.51 | 66.62 | 66.23 | |
Multimodal | MultiSentiNet | 69.84 | 69.84 | 68.86 | 68.11 |
HSAN | 69.88 | 66.90 | 67.96 | 67.76 | |
Co-MN-Hop6 | 70.51 | 70.01 | 68.92 | 68.83 | |
MGNNS | 73.77 | 72.70 | N/A | – | |
CLMLF | 75.33 | 73.46 | 70.53 | 67.45 | |
Ours | 76.44 | 75.61 | 70.53 | 67.97 |
Modality | Model | HFM | |
---|---|---|---|
Acc (%) | F1 (%) | ||
Text | CNN | 80.03 | 75.32 |
BiLSTM | 81.90 | 77.53 | |
BERT | 83.89 | 83.26 | |
Image | ResNet-50 | 72.77 | 71.38 |
ResNet-101 | 72.48 | 71.22 | |
Multimodal | Concat (2) | 81.03 | 77.99 |
Concat (3) | 81.74 | 78.74 | |
MMSD | 83.44 | 80.81 | |
D&R Net | 85.02 | 80.60 | |
CLMLF | 85.43 | 84.87 | |
Ours | 86.64 | 86.22 |
Model | MVSA-Single | MVSA-Multiple | HFM | |||
---|---|---|---|---|---|---|
Acc (%) | F1 (%) | Acc (%) | F1 (%) | Acc (%) | F1 (%) | |
BERT | 71.11 | 69.70 | 67.59 | 66.24 | 83.89 | 83.26 |
ResNet-50 | 64.67 | 61.55 | 61.88 | 60.98 | 72.77 | 71.38 |
BERT + ResNet-50 + MLF + DBCL + LBCL | 75.33 | 73.46 | 70.53 | 67.45 | 85.43 | 84.87 |
BERT + ResNet-50 + MLF + SCSupConLoss | 75.33 | 75.75 | 69.88 | 67.14 | 86.27 | 85.70 |
BERT + ResNet-50 + MLFC + LBCL + SCSupConLoss | 76.44 | 75.61 | 70.53 | 67.97 | 86.64 | 86.22 |
Image | Text | BERT | Ours |
---|---|---|---|
Campaigning in Polegate this morning with the ebullient MariaCaulfield and the frugal Francis Maude | Neutral | Positive | |
AVFCBlog oldmansaid JackWoodwardAV StanCollymore avfcforums worried avfc utv | Neutral | Negative | |
laurencekinlan: Met my hero in Cork last night, my bleeding Ronnie looks disgraceful though? | Negative | Positive |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, H.; Li, X.; Ren, Z.; Wang, M.; Ma, C. Multimodal Sentiment Analysis Representations Learning via Contrastive Learning with Condense Attention Fusion. Sensors 2023, 23, 2679. https://doi.org/10.3390/s23052679
Wang H, Li X, Ren Z, Wang M, Ma C. Multimodal Sentiment Analysis Representations Learning via Contrastive Learning with Condense Attention Fusion. Sensors. 2023; 23(5):2679. https://doi.org/10.3390/s23052679
Chicago/Turabian StyleWang, Huiru, Xiuhong Li, Zenyu Ren, Min Wang, and Chunming Ma. 2023. "Multimodal Sentiment Analysis Representations Learning via Contrastive Learning with Condense Attention Fusion" Sensors 23, no. 5: 2679. https://doi.org/10.3390/s23052679
APA StyleWang, H., Li, X., Ren, Z., Wang, M., & Ma, C. (2023). Multimodal Sentiment Analysis Representations Learning via Contrastive Learning with Condense Attention Fusion. Sensors, 23(5), 2679. https://doi.org/10.3390/s23052679