Multimodal Recommendation System Based on Cross Self-Attention Fusion
Abstract
:1. Introduction
- We propose a new recommendation algorithm called MR-CSAF. It leverages an adaptive modal selector and a cross-self-attention fusion mechanism that aims at accurately modeling both products and users.
- We propose an adaptive modal selector that constructs a potential multimodal item relationship graph by dynamically adjusting modal weights, significantly enhancing the information fusion process in multimodal recommendation systems.
- We designed a fusion module based on a cross-self-attention mechanism that explores intra- and inter-modal interactions, allowing for more comprehensive and efficient processing and fusion of multimodal information.
- We evaluate MR-CSAF performance against several baseline approaches on three public benchmark datasets for multimodal recommendation, demonstrating superior performance compared to existing models.
2. Proposed Method
2.1. Problem Definition
2.2. Adaptive Modal Selector
2.3. Cross-Self-Attention Fusion Module
2.4. Prediction Layer
2.5. Loss Function
3. Experiment
3.1. Research Question
- RQ1: Can MR-CSAF outperform current mainstream multimodal recommendation algorithms on baseline tasks?
- RQ2: Do innovative components of MR-CSAF contribute to its performance?
- RQ3: How do hyperparameters affect MR-CSAF’s performance?
3.2. Datasets
3.3. Baseline Methods
- VBPR [35]: Combining visual features and implicit feedback data to incorporate the visual features of goods into the recommendation system in order to improve the accuracy of personalized ranking.
- MMGCN [25]: By processing graph data at different scales, it is able to capture both local and global feature information.
- GRCN [26]: Introducing gating mechanisms and relational information to optimize the interaction modeling between users and items so as to enhance the recommendation effect.
- SLMRec [36]: Capturing both local and global feature information by processing graph data at different scales.
- LATTICE [30]: Capturing the complex relationship between user preferences and item features and learning the underlying semantic item–item structure for recommendation.
- BM3 [37]: Removing the requirement of randomly sampling negative instances required for interactions between users and items in the model and uses a latent embedding discard mechanism to perturb the original user’s and item’s embeddings.
- FREEDOM [10]: Improving recommendation accuracy by freezing the item–item graph structure and denoising the user–item interaction graph structure, while reducing memory consumption and improving computational efficiency.
- POWERec [8]: Modeling modality-specific user interests through basic user embeddings and different modality cues.
3.4. Experimental Setup and Evaluation Metrics
4. Results and Discussion
4.1. Overall Performance (RQ1)
- (1)
- Our proposed method effectively captures multimodal information by modeling product information in each dimension and propagating and aggregating it over the graph to accurately generate user modal embeddings.
- (2)
- Our adaptive modal selector combines the distinct dimensional characteristics of various products and dynamically calculates weights for image and text modalities. This effectively adjusts the influence of each modality across different product recommendations, enabling adaptive adjustments within the product–product graph structure. This approach optimizes the alignment of modality contributions, enhancing the recommendation system’s flexibility and responsiveness to diverse product attributes.
- (3)
- Our cross-self-awareness fusion module effectively captures both cross-modal and intra-modal interactions, enabling precise fusion and modeling of product attribute features alongside modality-specific features. This approach enhances the integration of diverse modality information for a more accurate representation of product attributes.
Datasets | Metrics | VBPR | MMGCN | GRCN | SLMRec | LATTICE | BM3 | FREEDOM | POWERec | MR-CSAF |
---|---|---|---|---|---|---|---|---|---|---|
Garden | NDCG@10 | 0.0547 | 0.0655 | 0.0758 | 0.0747 | 0.0849 | 0.0835 | 0.0791 | 0.0748 | 0.0948 |
NDCG@20 | 0.0709 | 0.0826 | 0.0945 | 0.0922 | 0.1022 | 0.1034 | 0.0961 | 0.0914 | 0.1136 | |
Recall@10 | 0.1030 | 0.1155 | 0.1361 | 0.1345 | 0.1571 | 0.1429 | 0.1376 | 0.1262 | 0.1637 | |
Recall@20 | 0.1651 | 0.1823 | 0.2090 | 0.2019 | 0.2242 | 0.2199 | 0.2026 | 0.1910 | 0.2379 | |
Baby | NDCG@10 | 0.0223 | 0.0220 | 0.0282 | 0.0285 | 0.0292 | 0.0301 | 0.0330 | 0.0311 | 0.0344 |
NDCG@20 | 0.0284 | 0.0282 | 0.0358 | 0.0357 | 0.0370 | 0.0383 | 0.0424 | 0.0398 | 0.0444 | |
Recall@10 | 0.0423 | 0.0421 | 0.0532 | 0.0540 | 0.0547 | 0.0564 | 0.0627 | 0.0579 | 0.0641 | |
Recall@20 | 0.0663 | 0.0660 | 0.0824 | 0.0810 | 0.0850 | 0.0883 | 0.0992 | 0.0918 | 0.1033 | |
Sports | NDCG@10 | 0.0307 | 0.0209 | 0.0306 | 0.0374 | 0.0335 | 0.0355 | 0.0385 | 0.0300 | 0.0394 |
NDCG@20 | 0.0384 | 0.0270 | 0.0389 | 0.0462 | 0.0421 | 0.0438 | 0.0481 | 0.0377 | 0.0498 | |
Recall@10 | 0.0558 | 0.0401 | 0.0559 | 0.0676 | 0.0620 | 0.0656 | 0.0617 | 0.0565 | 0.0734 | |
Recall@20 | 0.0856 | 0.0636 | 0.0877 | 0.1017 | 0.0953 | 0.0980 | 0.1089 | 0.0863 | 0.1130 |
4.2. Ablation Studies (RQ2)
- Adaptive modal selector (AMS): AMS enhances the Ours model by incorporating adaptive modality selectors but does not utilize the cross-self-attention mechanism to manage embeddings.
- Cross-Self-Attention Fusion (CSAF): CSAF learns interactions between different modalities. Within the same modality, it only uses the cross-self-attention fusion mechanism.
Datasets | Modules | NDCG@20 | Recall@20 |
---|---|---|---|
Garden | AMS | 0.1128 | 0.2378 |
CSAF | 0.1077 | 0.2256 | |
MR-CSAF | 0.1136 | 0.2379 | |
Baby | AMS | 0.0433 | 0.1014 |
CSAF | 0.0435 | 0.1012 | |
MR-CSAF | 0.0444 | 0.1033 | |
Sports | AMS | 0.0496 | 0.1123 |
CSAF | 0.0486 | 0.1106 | |
MR-CSAF | 0.0498 | 0.1130 |
4.3. Hyperparameter Study (RQ3)
4.3.1. Impact of Hidden_dim Size in Adaptive Modal Selectors
4.3.2. Effect of Feature Transformation Loss
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
References
- Ding, Y.; Ma, Y.; Wong, W.K.; Chua, T.S. Modeling instant user intent and content-level transition for sequential fashion recommendation. IEEE Trans. Multimed. 2008, 24, 142–149. [Google Scholar] [CrossRef]
- Wu, L.; Chen, L.; Hong, R.; Fu, Y.; Xie, X.; Wang, M. A hierarchical attention model for social contextual image recommendation. IEEE Trans. Knowl. Data Eng. 2019, 32, 1854–1867. [Google Scholar] [CrossRef]
- Yan, C.; Liu, L. Recommendation Method Based on Heterogeneous Information Network and Multiple Trust Relationship. Systems 2023, 11, 169. [Google Scholar] [CrossRef]
- Ma, T.; Huang, L.; Lu, Q.; Hu, S. Kr-gcn: Knowledge-aware reasoning with graph convolution network for explainable recommendation. ACM Trans. Inf. Syst. 2023, 41, 1–27. [Google Scholar] [CrossRef]
- Rendle, S.; Freudenthaler, C.; Gantner, Z.; Schmidt-Thieme, L. BPR: Bayesian personalized ranking from implicit feedback. arXiv 2012, arXiv:1205.2618. [Google Scholar]
- He, X.; Deng, K.; Wang, X.; Li, Y.; Zhang, Y.; Wang, M. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. In Proceedings of the SIGIR, Virtual, 25–30 July 2020; pp. 639–648. [Google Scholar]
- Zhou, X.; Lin, D.; Liu, Y.; Miao, C. Layer-refined graph convolutional networks for recommendation. In Proceedings of the 2023 IEEE 39th International Conference on Data Engineering (ICDE), Anaheim, CA, USA, 3–7 April 2023; pp. 1247–1259. [Google Scholar]
- Dong, X.; Song, X.; Tian, M.; Hu, L. Prompt-based and weak-modality enhanced multimodal recommendation. Inf. Fusion 2024, 101, 101989. [Google Scholar] [CrossRef]
- Molaie, M.M.; Lee, W. Economic corollaries of personalized recommendations. J. Retail. Consum. Serv. 2022, 68, 103003. [Google Scholar] [CrossRef]
- Zhou, X.; Shen, Z. A tale of two graphs: Freezing and denoising graph structures for multimodal recommendation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 935–943. [Google Scholar]
- Zhou, H.; Zhou, X.; Zeng, Z.; Zhang, L.; Shen, Z. A comprehensive survey on multimodal recommender systems: Taxonomy, evaluation, and future directions. arXiv 2023, arXiv:2302.04473. [Google Scholar]
- Liu, Q.; Hu, J.; Xiao, Y.; Zhao, X.; Gao, J.; Wang, W.; Li, Q.; Tang, J. Multimodal recommender systems: A survey. ACM Comput. Surv. 2024, 57, 1–17. [Google Scholar] [CrossRef]
- Gandhi, A.; Adhvaryu, K.; Poria, S.; Cambria, E.; Hussain, A. Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf. Fusion 2023, 91, 424–444. [Google Scholar] [CrossRef]
- Gadzicki, K.; Khamsehashari, R.; Zetzsche, C. Early vs late fusion in multimodal convolutional neural networks. In Proceedings of the 2020 IEEE 23rd International Conference on Information Fusion (FUSION), Rustenburg, South Africa, 6–9 July 2020; pp. 1–6. [Google Scholar]
- Wang, Y.; Xu, X.; Yu, W.; Xu, R.; Cao, Z.; Shen, H.T. Combine early and late fusion together: A hybrid fusion framework for image-text matching. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
- Li, K.; Xu, L.; Zhu, C.; Zhang, K. A Multimodal Graph Recommendation Method Based on Cross-Attention Fusion. Mathematics 2024, 12, 2353. [Google Scholar] [CrossRef]
- Wang, R.; Wu, Z.; Lou, J.; Jiang, Y. Attention-based dynamic user modeling and deep collaborative filtering recommendation. Expert Syst. Appl. 2022, 188, 116036. [Google Scholar] [CrossRef]
- Tao, Z.; Wei, Y.; Wang, X.; He, X.; Huang, X.; Chua, T.S. Mgat: Multimodal graph attention network for recommendation. Inf. Process. Manag. 2020, 57, 102277. [Google Scholar] [CrossRef]
- Hu, Z.; Cai, S.M.; Wang, J.; Zhou, T. Collaborative recommendation model based on multi-modal multi-view attention network: Movie and literature cases. Appl. Soft Comput. 2023, 144, 110518. [Google Scholar] [CrossRef]
- He, Q.; Liu, S.; Liu, Y. Optimal Recommendation Models Based on Knowledge Representation Learning and Graph Attention Networks. IEEE Access 2023, 11, 19809–19818. [Google Scholar] [CrossRef]
- Liu, F.; Chen, H.; Cheng, Z.; Liu, A.; Nie, L.; Kankanhalli, M. Disentangled multimodal representation learning for recommendation. IEEE Trans. Multimed. 2022, 25, 7149–7159. [Google Scholar] [CrossRef]
- Liu, F.; Cheng, Z.; Sun, C.; Wang, Y.; Nie, L.; Kankanhalli, M. User diverse preference modeling by multimodal attentive metric learning. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1526–1534. [Google Scholar]
- Wu, C.; Wu, F.; Qi, T.; Zhang, C.; Huang, Y.; Xu, T. Mm-rec: Visiolinguistic model empowered multimodal news recommendation. In Proceedings of the 45th international ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 2560–2564. [Google Scholar]
- Xun, J.; Zhang, S.; Zhao, Z.; Zhu, J.; Zhang, Q.; Li, J.; He, X.; He, X.; Chua, T.S.; Wu, F. Why do we click: Visual impression-aware news recommendation. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20–24 October 2021; pp. 3881–3890. [Google Scholar]
- Wei, Y.; Wang, X.; Nie, L.; He, X.; Hong, R.; Chua, T.S. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1437–1445. [Google Scholar]
- Wei, Y.; Wang, X.; Nie, L.; He, X.; Chua, T.S. Graph-refined convolutional network for multimedia recommendation with implicit feedback. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 3541–3549. [Google Scholar]
- Wang, Q.; Wei, Y.; Yin, J.; Wu, J.; Song, X.; Nie, L. Dualgnn: Dual graph neural network for multimedia recommendation. IEEE Trans. Multimed. 2021, 25, 1074–1084. [Google Scholar] [CrossRef]
- Chen, F.; Wang, J.; Wei, Y.; Zheng, H.T.; Shao, J. Breaking isolation: Multimodal graph fusion for multimedia recommendation by edge-wise modulation. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 385–394. [Google Scholar]
- Mu, Z.; Zhuang, Y.; Tan, J.; Xiao, J.; Tang, S. Learning hybrid behavior patterns for multimedia recommendation. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 376–384. [Google Scholar]
- Zhang, J.; Zhu, Y.; Liu, Q.; Wu, S.; Wang, S.; Wang, L. Mining latent structures for multimedia recommendation. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20–24 October 2021; pp. 3872–3880. [Google Scholar]
- Lei, F.; Cao, Z.; Yang, Y.; Ding, Y.; Zhang, C. Learning the user’s deeper preferences for multi-modal recommendation systems. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–18. [Google Scholar] [CrossRef]
- Arora, S.; Liang, Y.; Ma, T. A simple but tough-to-beat baseline for sentence embeddings. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26–30 June 2016; pp. 770–778. [Google Scholar]
- Linden, G.; Smith, B.; York, J. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet Comput. 2003, 7, 76–80. [Google Scholar] [CrossRef]
- He, R.; McAuley, J.J. VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback. In Proceedings of the AAAI, Phoenix, AZ, USA, 12–17 February 2016; pp. 144–150. [Google Scholar]
- Tao, Z.; Liu, X.; Xia, Y.; Wang, X.; Yang, L.; Huang, X.; Chua, T. Self-Supervised Learning for Multimedia Recommendation. IEEE Trans. Multim. 2023, 25, 5107–5116. [Google Scholar] [CrossRef]
- Zhou, X.; Zhou, H.; Liu, Y.; Zeng, Z.; Miao, C.; Wang, P.; You, Y.; Jiang, F. Bootstrap Latent Representations for Multi-modal Recommendation. In Proceedings of the WWW, Melbourne, Australia, 14–20 May 2023; pp. 845–854. [Google Scholar]
Dataset | # Users | # Items | # Interactions | # Sparsity |
---|---|---|---|---|
Garden | 1686 | 961 | 13,274 | 99.18% |
Baby | 19,445 | 7050 | 160,792 | 99.88% |
Sports | 35,598 | 18,357 | 296,337 | 99.95% |
Dataset | Introduction | Visual Information | Text Information |
---|---|---|---|
Garden | Waste materials, pesticides and tools, seeds, etc. | title: Victor M231 Ultimate Flea Trap Refills, 3 Per Pack.description: These flea trap refills have a sweet odor inserted in the specially formulated sticky glue disc. Fleas do not stand a chance. Our flea traps let pet owners see the results. The non-poisonous and odorless trap refills enable safe placement of the flea trap around children and pets. | |
Baby | Mother and baby products, such as bottles, strollers, etc. | title: Nuby 2 Pack Soft Sipper Replacement Spout, Clear.description: 2 pack Soft Sipper Replacement Spouts fits standard neck bottles. Made from soft durable silicone. Non-drip design helps prevent leaks and spills. | |
Sports | Sports products such as shoes, shirts, etc. | title: Wenzel Multi Purpose Ground Mat. description: The multi-purpose ground mat is constructed of weatherproof and water resistant woven polypropylene material. Folds down to carry size with built-in handles and pockets. Be ready for sand, ground and grass protection with this handy multi purpose personal gear mat. feature: `Easy to use’, `High quality product’, `Manufactured in China’. |
Datasets | Modules | NDCG@20 | Recall@20 |
---|---|---|---|
Baby | CAmgr-CAM | 0.0386 | 0.0847 |
CSAF | 0.0435 | 0.1012 | |
Sports | CAmgr-CAM | 0.0341 | 0.0905 |
CSAF | 0.0486 | 0.1106 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, P.; Zhan, W.; Gao, L.; Wang, S.; Yang, L. Multimodal Recommendation System Based on Cross Self-Attention Fusion. Systems 2025, 13, 57. https://doi.org/10.3390/systems13010057
Li P, Zhan W, Gao L, Wang S, Yang L. Multimodal Recommendation System Based on Cross Self-Attention Fusion. Systems. 2025; 13(1):57. https://doi.org/10.3390/systems13010057
Chicago/Turabian StyleLi, Peishan, Weixiao Zhan, Lutao Gao, Shuran Wang, and Linnan Yang. 2025. "Multimodal Recommendation System Based on Cross Self-Attention Fusion" Systems 13, no. 1: 57. https://doi.org/10.3390/systems13010057
APA StyleLi, P., Zhan, W., Gao, L., Wang, S., & Yang, L. (2025). Multimodal Recommendation System Based on Cross Self-Attention Fusion. Systems, 13(1), 57. https://doi.org/10.3390/systems13010057