A Parallel Multi-Modal Factorized Bilinear Pooling Fusion Method Based on the Semi-Tensor Product for Emotion Recognition
Abstract
1. Introduction
- (1)
- Multi-modal, factorized, bilinear pooling based on STP, which can avoid data redundancy due to dimension matching, and reduces the computational and memory costs.
- (2)
- We proposed a parallel, multi-modal, factorized, bilinear pooling method based on STP which can capture the rich interactions between the features by hierarchical fusion, and which realizes the arbitrary combination and fusion of three modalities.
- (3)
- Experimental evaluation of the proposed methodology on two multi-modal datasets.
2. Notation and Preliminaries
3. Methodology
3.1. Model Architecture
3.2. Multi-Modal, Factorized Bilinear Pooling
3.3. Multi-Modal, Factorized Bilinear Pooling Based on STP
3.4. Parallel, Multi-Modal, Factorized Bilinear Pooling Based on STP (PFBP-STP)
| Algorithm 1 PFBP-STP |
| Input:
vectors , and ; Output: vector , ;
|
4. Experimental
- (1)
- Comparison with state-of-the-art: We conducted experiments on PFBP-STP and state-of-the-art methods for an emotion-recognition task on IEMOCAP and CMU-MOSI datasets;
- (2)
- The advantage of the PFBP-STP: It allows the information fusion independent of the dimension-matching conditions in matrix multiplication by replacing matrix products with semi-tensor products;
- (3)
- Complexity analysis: We evaluate the speed and learned parameters of the method by comparing them with those of other methods.
4.1. Datasets
4.2. Multi-Modal Data Features
4.3. Baseline
4.4. Evaluation Metrics
4.5. Training Setup
5. Results and Discussion
5.1. Comparison with the State-of-the-Art
5.2. Ablation Experiment
5.3. Evaluation Indicators
5.4. Computational Complexity
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| TFN | Tensor Fusion Network |
| EF-LSTM | Early Fusion LSTM |
| LF-LSTM | Late Fusion LSTM |
| MFB | Multi-Modal, Factorized Bilinear pooling |
| MuLT | Multi-modal Transformer |
| LMF-MulT | Low-Rank Fusion-based Transformer for Multi-modal Sequences |
| MFN | Memory Fusion Network |
| LMF | Low-rank Multi-modal Fusion |
| ACC | Accuracy |
| MAE | Mean Absolute Error |
References
- Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef] [PubMed]
- Habibian, A.; Mensink, T.; Snoek, C. VideoStory Embeddings Recognize Events when Examples are Scarce. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 2013–2089. [Google Scholar] [CrossRef] [PubMed]
- Shuang, W.; Bondugula, S.; Luisier, F.; Zhuang, X.; Natarajan, P. Zero-Shot Event Detection Using Multi-modal Fusion of Weakly Supervised Concepts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; IEEE: Washington, DC, USA, 2014; pp. 2665–2672. [Google Scholar]
- Park, S.; Han, S.S.; Chatterjee, M.; Sagae, K.; Morency, L.P. Computational Analysis of Persuasiveness in Social Multimedia: A Novel Dataset and Multimodal Prediction Approach. In Proceedings of the 16th International Conference on Multimodal Interaction, New York, NY, USA, 12–16 November 2014; pp. 50–57. [Google Scholar]
- Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor Fusion Network for Multimodal Sentiment Analysis. arXiv 2017, arXiv:1707.07250. [Google Scholar]
- Liu, F.; Chen, J.F.; Tan, W.J.; Cai, C. A Multi-Modal Fusion Method Based on Higher-Order Orthogonal Iteration Decomposition. Entropy 2021, 23, 1349. [Google Scholar] [CrossRef]
- Wu, D.; Chen, J.; Deng, W.; Wei, Y.; Luo, H.; Wei, Y. The recognition of teacher behavior based on multimodal information fusion. Math. Probl. Eng. 2020, 2020, 8269683. [Google Scholar] [CrossRef]
- Qi, J.; Peng, Y. Cross-modal Bidirectional Translation via Reinforcement Learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; International Joint Conference on Artificial Intelligence: Stockholm, Sweden, 2018; pp. 2630–2636. [Google Scholar]
- Lee, S.; Kim, I. Multimodal feature learning for video captioning. Math. Probl. Eng. 2018, 2018, 3125879. [Google Scholar] [CrossRef]
- Bai, J.S.; Chen, J.F.; Wang, M. Multimodal Urban Sound Tagging with Spatiotemporal Context. IEEE Trans. Cogn. Dev. Syst. 2022, 2022, 1. [Google Scholar] [CrossRef]
- Guo, W.; Wang, J.; Wanga, S. Deep Multimodal Representation Learning: A Survey. IEEE Access 2019, 7, 63373–63394. [Google Scholar] [CrossRef]
- Xie, Z.; Guan, L. Multimodal Information Fusion of Audio Emotion Recognition Based on Kernel Entropy Component Analysis. Int. J. Semant. Comput. 2013, 7, 25–42. [Google Scholar] [CrossRef]
- Pang, L.; Ngo, C.W. Mutlimodal learning with deep boltzmann machine for emotion prediction in user generated videos. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval; ACM: New York, NY, USA, 2015; pp. 619–622. [Google Scholar]
- Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Florence, Italy, 2019; pp. 6558–6569. [Google Scholar]
- Sahay, S.; Okur, E.; Kumar, S.H.; Nachman, L. Low rank fusion based transformers for multimodal sequences. arXiv 2020, arXiv:2007.02038. [Google Scholar]
- Zhou, H.; Du, J.; Zhang, Y.; Wang, Q.; Liu, Q.F.; Lee, C.H. Information fusion in attention networks using adaptive and multi-level factorized bilinear pooling for audio-visual emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 2617–2629. [Google Scholar] [CrossRef]
- Mansoorizadeh, M.; Moghaddam Charkari, N. Multimodal information fusion application to human emotion recognition from face and speech. Multimed. Tools Appl. 2010, 49, 277–297. [Google Scholar] [CrossRef]
- Wang, Y.; Guan, L.; Venetsanopoulos, A.N. Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition. IEEE Trans. Multimed. 2012, 14, 597–607. [Google Scholar] [CrossRef]
- Li, S.; Zheng, W.; Zong, Y.; Lu, C.; Tang, C.; Jiang, X.; Xia, W. Bi-modality fusion for emotion recognition in the wild. In Proceedings of the 19th International Conference on Multimodal Interaction, Suzhou, China, 14–18 October 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 589–594. [Google Scholar]
- Liu, C.; Tang, T.; Lv, K.; Wang, M. Multi-feature based emotion recognition for video clips. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, Boulder, CO, USA, 16–20 October 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 630–634. [Google Scholar]
- Zeng, Z.; Tu, J.; Pianfetti, B.M.; Huang, T.S. Audio–visual affective expression recognition through multistream fused HMM. IEEE Trans. Multimed. 2008, 10, 570–577. [Google Scholar] [CrossRef]
- Mai, S.; Hu, H.; Xing, S. Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion. In Proceedings of the 32th AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; AAAI: New York, NY, USA, 2020; pp. 164–172. [Google Scholar]
- Fukui, A.; Park, D.H.; Yang, D.; Rohrbach, A.; Darrell, T.; Rohrbach, M. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. arXiv 2016, arXiv:1606.01847. [Google Scholar]
- Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.P. MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. arXiv 2016, arXiv:1606.06259. [Google Scholar]
- Zadeh, A.; Liang, P.P.; Poria, S.; Vij, P.; Morency, L.P. Multi-attention Recurrent Network for Human Communication Comprehension. In Proceedings of the 32 AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; AAAI: New Orleans, LA, USA, 2018. [Google Scholar]
- Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.; Morency, L.P. Efficient Low-rank Multimodal Fusion with Modality-Specific Factors. arXiv 2018, arXiv:1806.00064. [Google Scholar]
- Yu, Z.; Yu, J.; Fan, J.; Tao, D. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; IEEE: Venice, Italy, 2017; pp. 1821–1830. [Google Scholar]
- Zadeh, A.; Liang, P.P.; Mazumder, N.; Poria, S.; Morency, L.P. Memory Fusion Network for Multi-view Sequential Learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; AAAI: New Orleans, LA, USA, 2018; pp. 5634–5641. [Google Scholar]
- Cheng, D. Semi-tensor product of matrices and its application to Morgen’s problem. Sci. China Ser. Inf. Sci. 2001, 44, 195–212. [Google Scholar] [CrossRef]
- Fu, W.; Li, S. Semi-Tensor Compressed Sensing for Hyperspectral Image. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; IEEE: Valencia, Spain, 2018; pp. 2737–2740. [Google Scholar]
- Bai, Z.; Li, Y.; Zhou, M.; Li, D.; Wang, D.; Połap, D.; Woźniak, M. Bilinear Semi-Tensor Product Attention (BSTPA) model for visual question answering. In Proceedings of the 2020 International Joint Conference on Neural Networks, Glasgow, UK, 19–24 July 2020; IEEE: Glasgow, UK, 2020; pp. 1–8. [Google Scholar]
- Chen, Z.; Li, L.; Peng, H.; Liu, Y.; Yang, Y. A novel digital watermarking based on general non-negative matrix factorization. IEEE Trans. Multimed. 2018, 20, 1973–1986. [Google Scholar] [CrossRef]
- Cheng, D.; Qi, H.; Zhao, Y. An Introduction to Semi-Tensor Product of Matrices and Its Applications; World Scientific: Singapore, 2012. [Google Scholar]
- Cheng, D.; Qi, H. A linear representation of dynamics of Boolean networks. IEEE Trans. Autom. Control 2010, 55, 2251–2258. [Google Scholar] [CrossRef]
- Tucker, L. Some mathematical notes on three-mode factor analysis. Psychometrika 1966, 31, 279–311. [Google Scholar] [CrossRef] [PubMed]
- Liu, W.H.; Zhen, N.N.; You, Q.B. Non-negative matrix factorization and its application in pattern recognition. Chin. Sci. Bull. 2006, 51, 241–250. [Google Scholar]
- Hubert, L.; Meulman, J.; Heiser, W. Two purposes for matrix factorization: A historical appraisal. SIAM Rev. 2000, 42, 68–82. [Google Scholar] [CrossRef]
- Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive Emotional Dyadic Motion Capture Database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
- Pennington, J.; Socher, R.; Manning, C. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; Association for Computational Linguistics: Doha, Qatar, 2014; pp. 1532–1543. [Google Scholar]
- DeGottex, G.; Kane, J.; Drugman, T.; Raitio, T.; Scherer, S. COVAREP: A Collaborative Voice Analysis Repository for Speech Technologies. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing, Florence, Italy, 4–9 May 2014; IEEE: Florence, Italy, 2014; pp. 960–964. [Google Scholar]



| Dataset | IEMOCAP | CMU-MOSI |
|---|---|---|
| Training | 6373 | 1284 |
| Validation | 1775 | 229 |
| Test | 1807 | 686 |
| Emotion | Happy | Sad | Angry | Neutral | ||||
|---|---|---|---|---|---|---|---|---|
| Metric | F1 | Acc | F1 | Acc | F1 | Acc | F1 | Acc |
| EF-LSTM | 75.7 | 76.2 | 70.5 | 70.2 | 67.1 | 72.7 | 57.4 | 58.1 |
| LF-LSTM | 71.8 | 72.5 | 70.4 | 72.9 | 67.9 | 68.6 | 56.2 | 59.6 |
| MulT | 79.0 | 85.6 | 70.3 | 79.4 | 65.4 | 75.8 | 44.0 | 59.2 |
| LMF-MulT | 79.0 | 85.6 | 70.3 | 79.4 | 65.4 | 75.8 | 44.0 | 59.3 |
| OURS | 79.0 | 85.7 | 70.3 | 79.5 | 65.4 | 75.9 | 43.8 | 59.2 |
| Metric | MAE | Corr | Acc-2 | F1 | Acc-7 |
|---|---|---|---|---|---|
| EF-LSTM | 1.078 | 0.542 | 73.6 | 74.5 | 31.0 |
| LF-LSTM | 0.988 | 0.624 | 77.6 | 77.8 | 33.7 |
| MulT | 1.008 | 0.645 | 80.3 | 80.4 | 34.3 |
| LMF-MulT | 0.957 | 0.681 | 78.5 | 78.5 | 34.0 |
| OURS | 1.038 | 0.683 | 71.7 | 78.9 | 34.5 |
| Methods | Happy | Sad | Angry | Neutral | ||||
|---|---|---|---|---|---|---|---|---|
| F1 | Acc | F1 | Acc | F1 | Acc | F1 | Acc | |
| PFBT-STP (a+v) | 77.9 | 85.2 | 69.2 | 79.3 | 65.4 | 75.8 | 42.5 | 57.9 |
| PFBT-STP (a+t) | 77.8 | 85.6 | 69.4 | 79.2 | 64.3 | 75.7 | 41.2 | 58.2 |
| PFBT-STP (t+v) | 78.5 | 85.7 | 70.0 | 79.5 | 65.1 | 75.9 | 43.7 | 58.8 |
| PFBT-STP (a+v+t) | 79.0 | 85.7 | 70.3 | 79.5 | 65.4 | 75.9 | 43.8 | 59.2 |
| Dataset | IEMOCAP | CMU-MOSI |
|---|---|---|
| MulT | 1074998 | 1071211 |
| LMF-MulT | 856078 | 512121 |
| OURS | 559872 | 500342 |
| Dataset | IEMOCAP | CMU-MOSI |
|---|---|---|
| MulT | 37.93 | 19.25 |
| LMF-MulT | 23.53 | 12.03 |
| OURS | 17.92 | 11.92 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, F.; Chen, J.; Li, K.; Tan, W.; Cai, C.; Ayub, M.S. A Parallel Multi-Modal Factorized Bilinear Pooling Fusion Method Based on the Semi-Tensor Product for Emotion Recognition. Entropy 2022, 24, 1836. https://doi.org/10.3390/e24121836
Liu F, Chen J, Li K, Tan W, Cai C, Ayub MS. A Parallel Multi-Modal Factorized Bilinear Pooling Fusion Method Based on the Semi-Tensor Product for Emotion Recognition. Entropy. 2022; 24(12):1836. https://doi.org/10.3390/e24121836
Chicago/Turabian StyleLiu, Fen, Jianfeng Chen, Kemeng Li, Weijie Tan, Chang Cai, and Muhammad Saad Ayub. 2022. "A Parallel Multi-Modal Factorized Bilinear Pooling Fusion Method Based on the Semi-Tensor Product for Emotion Recognition" Entropy 24, no. 12: 1836. https://doi.org/10.3390/e24121836
APA StyleLiu, F., Chen, J., Li, K., Tan, W., Cai, C., & Ayub, M. S. (2022). A Parallel Multi-Modal Factorized Bilinear Pooling Fusion Method Based on the Semi-Tensor Product for Emotion Recognition. Entropy, 24(12), 1836. https://doi.org/10.3390/e24121836

