A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism
Abstract
:1. Introduction
2. Related Works
3. Model Establishment
3.1. Overview of the Methodology
3.2. Video Feature Extraction
3.3. Audio Feature Extraction
3.4. Text Feature Extraction
3.5. Joint Attention Module
3.6. Chained Interactive Attention Module
3.7. Loss Function
4. Experiments
4.1. Experimental Environment Settings
4.2. Evaluation Indicators
4.3. Parameter Settings
4.4. Comparison Models
4.5. Results
4.5.1. Polarity Analysis Results
4.5.2. Sentiment Recognition Results
4.6. Analysis of Ablation Experiments
4.6.1. VAE-JCIA Methodological Validity
4.6.2. Modal Fusion Sequential Validity
4.6.3. Balancing Factor Adjustments
4.6.4. Noise Robustness Testing
5. Conclusions
- (1)
- Although the information fusion between different modalities is enhanced by the joint chained interactive attention mechanism, the fusion depth and efficiency of the mechanism may be limited by the information contained in the initial representation of the modal features. It is now difficult to achieve a 100% complete parsing of the absolute emotions of characters in video and audio, and it is a challenge for the model to achieve deeper information fusion while maintaining computational efficiency.
- (2)
- Although the paper was validated on the CMU-MOSEI and IEMOCAP datasets, these datasets may not be able to encompass all the multimodal variations in emotional expressions in real-world scenarios. Therefore, the generalization ability of the model and its applicability in different application contexts are issues that need to be considered in future research.
- (3)
- In real scenarios, there may be information inconsistency or direct conflict between different modalities. Whether current models can effectively solve such problems and how to optimally deal with such modal inconsistencies remain issues worth exploring in the future.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Liang, B.; Su, H.; Gui, L.; Cambria, E.; Xu, R. Aspect-based sentiment analysis via affective knowledge enhanced graph convolutional networks. Knowl. Based Syst. 2022, 235, 107643. [Google Scholar] [CrossRef]
- Zhu, Y.; Dong, J.; Xie, L.; Wang, Z.; Qin, S.; Xu, P.; Yin, M. Recurrent Multi-View Collaborative Registration Network for 3D Reconstruction and Optical Measurement of Blade Profiles. Knowl. Based Syst. 2024, 295, 111857. [Google Scholar] [CrossRef]
- Chen, L.; Guan, Z.Y.; He, J.; Peng, J. A survey on sentiment classification. J. Comput. Res. Dev. 2017, 54, 1150–1170. [Google Scholar]
- Zhou, J.; Zhao, J.; Huang, J.X.; Hu, Q.V.; He, L. MASAD: A large-scale dataset for multimodal aspect-based sentiment analysis. Neurocomputing 2021, 455, 47–58. [Google Scholar] [CrossRef]
- Zhu, L.; Zhu, Z.; Zhang, C.; Xu, Y.; Kong, X. Multimodal sentiment analysis based on fusion methods: A survey. Inf. Fusion 2023, 95, 306–325. [Google Scholar] [CrossRef]
- Gandhi, A.; Adhvaryu, K.; Poria, S.; Cambria, E.; Hussain, A. Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf. Fusion 2023, 91, 424–444. [Google Scholar] [CrossRef]
- Fu, Z.; Liu, F.; Xu, Q.; Qi, J.; Fu, X.; Zhou, A.; Li, Z. NHFNET: A non-homogeneous fusion network for multimodal sentiment analysis. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; IEEE: New York, NY, USA, 2022; pp. 1–6. [Google Scholar]
- Cao, D.; Ji, R.; Lin, D.; Li, S. A cross-media public sentiment analysis system for microblog. Multimed. Syst. 2016, 22, 479–486. [Google Scholar] [CrossRef]
- Cao, D.; Ji, R.; Lin, D.; Li, S. Visual sentiment topic model based microblog image sentiment analysis. Multimed. Tools Appl. 2016, 75, 8955–8968. [Google Scholar] [CrossRef]
- Zhang, Y.; Song, D.; Li, X.; Zhang, P.; Wang, P.; Rong, L.; Yu, G.; Wang, B. A quantum-like multimodal network framework for modeling interaction dynamics in multiparty conversational sentiment analysis. Inf. Fusion 2020, 62, 14–31. [Google Scholar] [CrossRef]
- Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.-P. Tensor fusion network for multimodal sentiment analysis. arXiv 2017, arXiv:1707.07250. [Google Scholar]
- Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.; Morency, L.-P. Efficient low-rank multimodal fusion with modality-specific factors. arXiv 2018, arXiv:1806.00064. [Google Scholar]
- Tsai, Y.-H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal transformer for unaligned multimodal language sequences. Proc. Conf. Assoc. Comput. Linguist. Meet. 2019, 2019, 6558. [Google Scholar]
- Hazarika, D.; Zimmermann, R.; Poria, S. MISA: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1122–1131. [Google Scholar]
- Xu, N.; Mao, W.; Chen, G. Multi-interactive memory network for aspect based multimodal sentiment analysis. Proc. Aaai Conf. Artif. Intell. 2019, 33, 371–378. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Guangbin, B.; Li, G.; Wang, G. Bimodal Interactive Attention for Multimodal Sentiment Analysis. J. Front. Comput. Sci. Technol. 2022, 16, 909. [Google Scholar]
- Hu, H.; Ding, Z.; Zhang, Y.; Liu, M. Images-Text Sentiment Analysis in Social Media Based on Joint and Interactive Attention. J. Beijing Univ. Aeronaut. Astronaut. 2023. (In Chinese) [Google Scholar]
- Fan, T.; Wu, P.; Wang, H.; Ling, C. Sentiment Analysis of Online Users Based on Multimodal Co-attention. J. China Soc. Sci. Tech. Inf. 2021, 40, 656–665. (In Chinese) [Google Scholar]
- Dai, Z.; Lai, G.; Yang, Y.; Le, Q. Funnel-transformer: Filtering out sequential redundancy for efficient language processing. Adv. Neural Inf. Process. Syst. 2020, 33, 4271–4282. [Google Scholar]
- Gulati, A.; Qin, J.; Chiu, C.-C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Shen, Y.; Mariconti, E.; Vervier, P.-A.; Stringhini, G. T: Predicting security events through deep learning. arXiv 2019, arXiv:1905.10328. [Google Scholar]
- Shahid, F.; Zameer, A.; Muneeb, M. A novel genetic LSTM model for wind power forecast. Energy 2021, 223, 120069. [Google Scholar] [CrossRef]
- Fang, X.; Xu, M.; Xu, S.; Zhao, P. A deep learning framework for predicting cyber attacks rates. EURASIP J. Inf. Secur. 2019, 2019, 5. [Google Scholar] [CrossRef]
- Yao, Z.; Zhang, T.; Wang, Q.; Zhao, Y. Short-term power load forecasting of integrated energy system based on attention-CNN-DBILSTM. Math. Probl. Eng. 2022, 2022, 1075698. [Google Scholar] [CrossRef]
- Bengio, Y.; Ducharme, R.; Vincent, P. A neural probabilistic language model. Adv. Neural Inf. Process. Syst. 2000, 33, 4271–4282. [Google Scholar]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
- Zadeh, A.A.B.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.-P. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; Association for Computational Linguistics: Melbourne, VIC, Australia, 2018; pp. 2236–2246. [Google Scholar]
- Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
- Akhtar, M.S.; Ekbal, A.; Cambria, E. How intense are you? Predicting intensities of emotions and sentiments using stacked ensemble [application notes]. IEEE Comput. Intell. Mag. 2020, 15, 64–75. [Google Scholar] [CrossRef]
- Krommyda, M.; Rigos, A.; Bouklas, K.; Amditis, A. An experimental analysis of data annotation methodologies for emotion detection in short text posted on social media. Informatics 2021, 8, 19. [Google Scholar] [CrossRef]
- Zadeh, A.; Liang, P.P.; Mazumder, N.; Poria, S.; Cambria, E.; Morency, L.-P. Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Zadeh, A.; Liang, P.P.; Poria, S.; Vij, P.; Cambria, E.; Morency, L.-P. Multi-attention recurrent network for human communication comprehension. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Pham, H.; Liang, P.P.; Manzini, T.; Morency, L.-P.; Póczos, B. Found in translation: Learning robust joint representations by cyclic translations between modalities. Proc. Aaai Conf. Artif. Intell. 2019, 33, 6892–6899. [Google Scholar] [CrossRef]
- Liang, P.P.; Liu, Z.; Zadeh, A.; Morency, L.-P. Multimodal language analysis with recurrent multistage fusion. arXiv 2018, arXiv:1808.03920. [Google Scholar]
- Wang, H. Sentiment Analysis Based on Multimodal Feature Fusion. Master’s Thesis, Nanjing University of Posts and Telecommunications, Nanjing, China, 2023. [Google Scholar]
- Verma, S.; Wang, J.; Ge, Z.; Shen, R.; Jin, F.; Wang, Y.; Chen, F.; Liu, W. Deep-HOSeq: Deep higher order sequence fusion for multimodal sentiment analysis. In Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM), Sorrento, Italy, 17–20 November 2020; IEEE: New York, NY, USA, 2020; pp. 561–570. [Google Scholar]
- Sun, H.; Wang, H.; Liu, J.; Chen, Y.-W.; Lin, L. CubeMLP: An MLP-based model for multimodal sentiment analysis and depression estimation. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 3722–3729. [Google Scholar]
- Shi, P.; Hu, M.; Shi, X.; Ren, F. Deep Modular Co-Attention Shifting Network for Multimodal Sentiment Analysis. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 109. [Google Scholar] [CrossRef]
- Yu, W.; Xu, H.; Yuan, Z.; Wu, J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In Proc. Aaai Conf. Artif. Intell. 2021, 35, 10790–10797. [Google Scholar] [CrossRef]
- Yoon, S.; Byun, S.; Jung, K. Multimodal speech emotion recognition using audio and text. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; IEEE: New York, NY, USA, 2018; pp. 112–118. [Google Scholar]
- Hu, D.; Hou, X.; Wei, L.; Jiang, L.; Mo, Y. MM-DFN: Multimodal dynamic fusion network for emotion recognition in conversations. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 7037–7041. [Google Scholar]
- Wen, J.; Jiang, D.; Tu, G.; Liu, C.; Cambria, E. Dynamic interactive multiview memory network for emotion recognition in conversation. Inf. Fusion 2023, 91, 123–133. [Google Scholar] [CrossRef]
Authors | Contribution | Methodology |
---|---|---|
Cao et al. [8,9] | Visual Sentiment Theme Model using SVMs. | Decision-level fusion of text and image features. |
Zhang et al. [10] | Multimodal emotion recognition network. | Quantum-like network and interaction dynamics modeling. |
Zadeh et al. [11] | Tensor Fusion Network for dynamics. | Handles intra and inter-channel dynamics. |
Liu et al. [12] | Low rank tensor fusion method. | Reduces complexity in tensor fusion. |
Bai et al. [13] | Multimodal Transformer for data alignment. | End-to-end explicit alignment approach. |
Hazarika et al. [14] | Subspace projection for modality contexts. | Uses multi-headed attention for context learning. |
Xu et al. [15] | Interactive memory network with attention. | Fuses textual and visual information interactively. |
Devlin et al. [16] | Attention mechanism for aspectual words. | Local feature mining and fusion in modalities. |
Bao et al. [17] | Interactive attention for modality correlations. | Fuses features without enhancing correlations. |
Hu et al. [18] | Interaction attention for sentiment analysis. | Applied to graphic-textual modalities, with limits. |
Experimental Environment | Environmental Configuration |
---|---|
System | Linux |
GPU | RTX 4090 |
CPU | 16 vCPU Intel(R) Xeon(R) Platinum 8352V |
CPU @ 2.10 GHz | |
Pytorch | 1.11.0 |
Python | 3.8 |
Cuda | 11.3 |
Tasks | Data Set | Emotions | Divide | |
---|---|---|---|---|
Train + Val | Test | |||
Emotional polarity judgment | CMU-MOSEI | −3, −2, −1, 0, 1, 2, 3 | 17,830 | 4759 |
Multimodal sentiment analysis | IEMOCAP | Happy, Frustrated, Angry, Neutral, Sad | 5568 | 1623 |
Hyperparameter | Value |
---|---|
Optimization | Adam |
Audio Shape | 74 |
Video Shape | 35 |
Text Shape | 768 |
Output Dimension | 1 |
Hidden Audio Size | 64 |
Hidden Video Size | 64 |
Hidden Text Size | 64 |
Hidden Size | 32 |
Learning Rate | 0.00005 |
Batch Size | 32 |
Number of Epochs | 200 |
Early Stopping | 10 |
Dropout Input | 0.2 |
Attention Heads | 5 |
Gradient Clipping | 4.0 |
Round | ||||
---|---|---|---|---|
1 | 85.92 | 0.8894 | 53.09 | 0.5436 |
2 | 85.81 | 0.8887 | 53.17 | 0.5460 |
3 | 85.89 | 0.8891 | 52.92 | 0.5427 |
4 | 85.78 | 0.8882 | 52.92 | 0.5421 |
5 | 85.87 | 0.8890 | 53.09 | 0.5450 |
6 | 85.78 | 0.8876 | 52.85 | 0.5422 |
7 | 85.81 | 0.8886 | 52.94 | 0.5436 |
8 | 85.91 | 0.8822 | 53.07 | 0.5433 |
9 | 85.99 | 0.8897 | 53.17 | 0.5462 |
10 | 85.99 | 0.8897 | 53.11 | 0.5433 |
Average | 85.88 | 0.8889 | 53.03 | 0.5438 |
Round | WA% | UA% |
---|---|---|
1 | 71.94 | 70.31 |
2 | 70.11 | 69.77 |
3 | 71.55 | 70.22 |
4 | 71.89 | 70.28 |
5 | 71.01 | 70.12 |
6 | 71.54 | 70.28 |
7 | 70.55 | 69.93 |
8 | 71.89 | 70.38 |
9 | 71.74 | 70.15 |
10 | 71.34 | 70.73 |
Average | 71.36 | 70.22 |
Model | MAE/% | Corr/% | /% | /% | |
---|---|---|---|---|---|
Graph-MFN | 0.623 | 0.677 | 81.58 | 45.00 | 0.827 |
TFN | 0.593 | 0.700 | 80.80 | - | 0.825 |
MARN | 0.587 | 0.627 | 79.80 | 34.70 | 0.836 |
RMFN | 0.565 | 0.679 | 82.10 | 38.30 | 0.814 |
MCTN | 0.551 | 0.667 | 83.66 | - | 0.823 |
Deep-HOSeq | 0.551 | 0.688 | 84.22 | 44.17 | 0.846 |
LSTF-Dfusion | 0.546 | 0.691 | 85.73 | 46.35 | 0.851 |
CubeMLP | 0.529 | 0.760 | 85.10 | - | 0.845 |
MISA | 0.568 | 0.724 | 84.20 | - | 0.840 |
Self-MM | 0.724 | 0.762 | 85.20 | 52.90 | 0.851 |
MuIT | 0.580 | 0.703 | 82.50 | - | 0.823 |
Ours | 0.542 | 0.774 | 85.88 | 53.03 | 0.889 |
Model | WA/% | UA/% |
---|---|---|
MM-DFN | 68.21 | - |
GBAN | 71.39 | 70.08 |
MDRE | 71.80 | 71.40 |
DIMMN | 64.70 | - |
FG-CME | 71.01 | 71.66 |
Ours | 71.26 | 70.15 |
Model | IEMOCAP | CMU-MOSEI | ||||
---|---|---|---|---|---|---|
Accuracy5 | Recall5 | F1 | Accuracy2 | Recall2 | F1 | |
VAE-CIA | 0.670 | 0.724 | 0.6968 | 0.8180 | 0.8891 | 0.8517 |
VAE-JA | 0.649 | 0.614 | 0.6302 | 0.7920 | 0.8837 | 0.8351 |
VAE-JCIA | 0.722 | 0.778 | 0.7492 | 0.8510 | 0.9096 | 0.8846 |
Modal Fusion Order | MAE | Corr | /% | F1 | /% |
---|---|---|---|---|---|
A→E→V→A | 0.711 | 0.603 | 83.71 | 0.8491 | 51.09 |
A→V→E→A | 0.728 | 0.609 | 83.64 | 0.8352 | 49.79 |
V→A→E→V | 0.886 | 0.646 | 83.18 | 0.8291 | 51.34 |
V→E→A→V | 0.643 | 0.678 | 83.04 | 0.8281 | 52.45 |
E→V→A→E | 0.545 | 0.775 | 84.82 | 0.8757 | 53.59 |
E→A→V→E | 0.542 | 0.774 | 85.88 | 0.8889 | 53.03 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Qiu, K.; Zhang, Y.; Zhao, J.; Zhang, S.; Wang, Q.; Chen, F. A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism. Electronics 2024, 13, 1922. https://doi.org/10.3390/electronics13101922
Qiu K, Zhang Y, Zhao J, Zhang S, Wang Q, Chen F. A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism. Electronics. 2024; 13(10):1922. https://doi.org/10.3390/electronics13101922
Chicago/Turabian StyleQiu, Keyuan, Yingjie Zhang, Jiaxu Zhao, Shun Zhang, Qian Wang, and Feng Chen. 2024. "A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism" Electronics 13, no. 10: 1922. https://doi.org/10.3390/electronics13101922
APA StyleQiu, K., Zhang, Y., Zhao, J., Zhang, S., Wang, Q., & Chen, F. (2024). A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism. Electronics, 13(10), 1922. https://doi.org/10.3390/electronics13101922