Fine-Grained Local and Global Semantic Fusion for Multimodal Image–Text Retrieval
Abstract
:1. Introduction
2. Related Work
2.1. Global Alignment
2.2. Local Alignment
3. The Proposed Method
3.1. Feature Extraction
3.2. Region Relationship Reasoning
3.3. Semantic Relationship Enhancement
3.4. Loss Function
4. Experimental Results and Analysis
4.1. Datasets and Evaluation Metrics
4.2. Comparative Experiments
4.3. Ablation Studies
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Pan, Z.; Wu, F.; Zhang, B. Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 19275–19284. [Google Scholar]
- Li, K.; Zhang, Y.; Li, K.; Li, Y.; Fu, Y. Visual Semantic Reasoning for Image-Text Matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Fu, Z.; Mao, Z.; Song, Y.; Zhang, Y. Learning Semantic Relationship Among Instances for Image-Text Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 15159–15168. [Google Scholar]
- Wei, X.; Zhang, T.; Li, Y.; Zhang, Y.; Wu, F. Multi-Modality Cross Attention Network for Image and Sentence Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Albalawi, B.M.; Jamal, A.T.; Al Khuzayem, L.A.; Alsaedi, O.A. An End-to-End Scene Text Recognition for Bilingual Text. Big Data Cogn. Comput. 2024, 8, 117. [Google Scholar] [CrossRef]
- Zihao Ni, Z.Z.; Ren, P. Incorporating object counts into remote sensing image captioning. Int. J. Digit. Earth 2024, 17, 2392847. [Google Scholar] [CrossRef]
- Rao, J.; Wang, F.; Ding, L.; Qi, S.; Zhan, Y.; Liu, W.; Tao, D. Where Does the Performance Improvement Come From?—A Reproducibility Concern about Image-Text Retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval—SIGIR’22, New York, NY, USA, 11–15 July 2022; pp. 2727–2737. [Google Scholar] [CrossRef]
- Wu, Y.; Wang, S.; Song, G.; Huang, Q. Learning Fragment Self-Attention Embeddings for Image-Text Matching. In Proceedings of the 27th ACM International Conference on Multimedia—MM’19, New York, NY, USA, 21–25 October 2019; pp. 2088–2096. [Google Scholar] [CrossRef]
- Qu, L.; Liu, M.; Cao, D.; Nie, L.; Tian, Q. Context-Aware Multi-View Summarization Network for Image-Text Matching. In Proceedings of the 28th ACM International Conference on Multimedia—MM’20, Seattle, WA, USA, 12–16 October 2020; pp. 1047–1055. [Google Scholar] [CrossRef]
- Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. Stat 2017, 1050, 10–48550. [Google Scholar]
- Ashish, V. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, I. [Google Scholar]
- Chen, J.; Hu, H.; Wu, H.; Jiang, Y.; Wang, C. Learning the Best Pooling Strategy for Visual Semantic Embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 15789–15798. [Google Scholar]
- Faghri, F.; Fleet, D.J.; Kiros, J.R.; Fidler, S. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. arXiv 2018, arXiv:1707.05612. [Google Scholar] [CrossRef]
- Liu, M.; Qi, M.; Zhan, Z.; Qu, L.; Nie, X.; Nie, L. A Survey on Deep Learning Based Image-Text Matching. Chin. J. Comput. 2023, 46, 2370–2399. [Google Scholar]
- Wang, J.; Zhou, F.; Wen, S.; Liu, X.; Lin, Y. Deep Metric Learning with Angular Loss. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Wang, L.; Li, Y.; Lazebnik, S. Learning Deep Structure-Preserving Image-Text Embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Wang, L.; Li, Y.; Huang, J.; Lazebnik, S. Learning Two-Branch Neural Networks for Image-Text Matching Tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 394–407. [Google Scholar] [CrossRef] [PubMed]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar] [CrossRef]
- Biten, A.F.; Mafla, A.; Gómez, L.; Karatzas, D. Is an Image Worth Five Sentences? A New Look Into Semantics for Image-Text Matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 1391–1400. [Google Scholar]
- Lee, K.H.; Chen, X.; Hua, G.; Hu, H.; He, X. Stacked Cross Attention for Image-Text Matching. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Chen, H.; Ding, G.; Liu, X.; Lin, Z.; Liu, J.; Han, J. IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Cheng, Y.; Zhu, X.; Qian, J.; Wen, F.; Liu, P. Cross-modal Graph Matching Network for Image-text Retrieval. ACM Trans. Multimed. Comput. Commun. Appl. 2022, 18, 1–23. [Google Scholar] [CrossRef]
- Zhang, K.; Mao, Z.; Wang, Q.; Zhang, Y. Negative-Aware Attention Framework for Image-Text Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 15661–15670. [Google Scholar]
- Wang, J.H.; Norouzi, M.; Tsai, S.M. Augmenting Multimodal Content Representation with Transformers for Misinformation Detection. Big Data Cogn. Comput. 2024, 8, 134. [Google Scholar] [CrossRef]
- Nam, H.; Ha, J.W.; Kim, J. Dual Attention Networks for Multimodal Reasoning and Matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Seidenschwarz, J.D.; Elezi, I.; Leal-Taixé, L. Learning Intra-Batch Connections for Deep Metric Learning. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021; Proceedings of Machine Learning Research. Meila, M., Zhang, T., Eds.; Volume 139, pp. 9410–9421. [Google Scholar]
- KAYA, M.; BİLGE, H.Ş. Deep Metric Learning: A Survey. Symmetry 2019, 11, 1066. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision–ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer Nature: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Plummer, B.A.; Wang, L.; Cervantes, C.M.; Caicedo, J.C.; Hockenmaier, J.; Lazebnik, S. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Messina, N.; Amato, G.; Esuli, A.; Falchi, F.; Gennaro, C.; Marchand-Maillet, S. Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders. ACM Trans. Multimed. Comput. Commun. Appl. 2021, 17, 1–23. [Google Scholar] [CrossRef]
- Li, K.; Zhang, Y.; Li, K.; Li, Y.; Fu, Y. Image-Text Embedding Learning via Visual and Textual Semantic Reasoning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 641–656. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Q.; Lei, Z.; Zhang, Z.; Li, S.Z. Context-Aware Attention Network for Image-Text Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Wei, J.; Xu, X.; Yang, Y.; Ji, Y.; Wang, Z.; Shen, H.T. Universal Weighting Metric Learning for Cross-Modal Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Pei, J.; Zhong, K.; Yu, Z.; Wang, L.; Lakshmanna, K. Scene Graph Semantic Inference for Image and Text Matching. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2023, 22, 1–23. [Google Scholar] [CrossRef]
- Zhou, H.; Geng, Y.; Zhao, J.; Ma, X. Semantic-Enhanced Attention Network for Image-Text Matching. In Proceedings of the 2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Tianjin, China, 8–10 May 2024; pp. 1256–1261. [Google Scholar] [CrossRef]
- Liu, C.; Mao, Z.; Liu, A.-A.; Zhang, T.; Wang, B.; Zhang, Y. Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching. In Proceedings of the 27th ACM International Conference on Multimedia (MM ’19), Nice, France, 21–25 October 2019; pp. 3–11. [Google Scholar] [CrossRef]
- Wang, Z.; Liu, X.; Li, H.; Sheng, L.; Yan, J.; Wang, X.; Shao, J. CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5764–5773. Available online: https://openaccess.thecvf.com/content_ICCV_2019/html/Wang_CAMP_Cross-Modal_Adaptive_Message_Passing_for_Text-Image_Retrieval_ICCV_2019_paper.html (accessed on 3 July 2024).
Method | Text Retrieval | Image Retrieval | rSum | ||||
---|---|---|---|---|---|---|---|
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||
COCO 5K Test | |||||||
VSE++ [13] | 41.3 | 71.1 | 81.0 | 30.3 | 59.4 | 72.4 | 355.7 |
SCAN [20] | 50.4 | 82.4 | 82.9 | 38.6 | 69.3 | 80.4 | 410.9 |
VSRN [2] | 53.0 | 81.1 | 81.1 | 40.5 | 70.6 | 81.1 | 415.7 |
IMRAM [21] | 53.7 | 83.7 | 83.7 | 39.7 | 69.1 | 79.8 | 416.5 |
CAAN [35] | 52.5 | 83.0 | 83.9 | 41.2 | 70.3 | 82.9 | 421.1 |
CGMN [22] | 53.4 | 84.1 | 85.8 | 41.2 | 71.9 | 82.4 | 419.8 |
NAAF [23] | 58.9 | 84.1 | 85.8 | 42.5 | 70.9 | 81.4 | 430.9 |
VSRN++ [34] | 54.7 | 87.2 | 87.2 | 42.0 | 72.2 | 82.7 | 425.4 |
HREM [3] | 57.7 | 82.5 | 89.9 | 42.6 | 72.3 | 82.7 | 427.7 |
EISIN | 59.6 | 84.4 | 91.2 | 42.0 | 72.3 | 82.5 | 432.1 |
COCO 5-Fold 1K Test | |||||||
VSE++ [13] | 64.6 | 90.0 | 95.7 | 52.0 | 84.3 | 92.0 | 478.6 |
SCAN [20] | 72.7 | 94.8 | 98.4 | 58.8 | 88.4 | 94.8 | 507.9 |
VSRN [2] | 76.2 | 94.8 | 98.2 | 62.8 | 89.7 | 95.1 | 516.8 |
IMRAM [21] | 76.7 | 95.6 | 98.5 | 61.7 | 89.1 | 95.0 | 516.6 |
CAMERA [9] | 77.5 | 96.3 | 98.8 | 63.4 | 90.9 | 95.8 | 522.7 |
MPL [36] | 71.1 | 93.7 | 98.2 | 56.8 | 86.7 | 93.0 | 499.5 |
NAAF [23] | 78.1 | 96.1 | 98.6 | 63.5 | 89.6 | 95.3 | 521.2 |
SGSIN [37] | 76.7 | 96.5 | 99.1 | 61.7 | 89.6 | 95.3 | 523.6 |
SEAM [38] | 77.9 | 95.6 | 98.3 | 64.2 | 91.2 | 96.4 | 523.6 |
VSRN++ [34] | 77.9 | 96.0 | 98.5 | 64.1 | 91.0 | 96.1 | 523.6 |
HREM [3] | 78.2 | 95.3 | 98.2 | 64.4 | 90.9 | 95.9 | 522.9 |
EISIN | 79.3 | 95.9 | 98.5 | 63.9 | 90.8 | 95.8 | 524.1 |
Method | Text Retrieval | Image Retrieval | rSum | ||||
---|---|---|---|---|---|---|---|
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||
VSE++ [13] | 52.9 | 80.5 | 87.2 | 39.6 | 70.1 | 79.5 | 409.8 |
SCAN [20] | 67.4 | 90.3 | 95.8 | 48.6 | 71.7 | 85.2 | 465.0 |
VSRN [2] | 71.3 | 90.6 | 96.0 | 54.7 | 81.8 | 88.2 | 482.6 |
IMRAM [21] | 74.1 | 93.0 | 96.6 | 53.9 | 79.4 | 87.2 | 484.2 |
CAMERA [9] | 78.0 | 95.1 | 97.9 | 60.3 | 85.9 | 91.7 | 508.2 |
MPL [36] | 69.4 | 89.9 | 95.4 | 47.5 | 75.5 | 83.1 | 460.8 |
NAAF [23] | 79.6 | 96.3 | 98.3 | 59.3 | 83.9 | 90.2 | 507.6 |
SGSIN [37] | 73.1 | 93.6 | 96.8 | 53.9 | 80.1 | 87.2 | 484.7 |
SEAM [38] | 79.1 | 94.2 | 98.7 | 61.8 | 86.5 | 90.6 | 510.9 |
VSRN++ [34] | 79.2 | 94.6 | 97.5 | 60.6 | 85.6 | 91.4 | 508.9 |
HREM [3] | 83.3 | 96.1 | 98.4 | 62.2 | 86.4 | 91.8 | 518.2 |
EISIN | 83.3 | 96.2 | 98.3 | 63.6 | 87.3 | 92.0 | 520.7 |
Method | Flickr30K | MS-COCO | ||
---|---|---|---|---|
Encoding | Matching | Encoding | Matching | |
SCAN [20] | 9.7 s | 599.0 s | 44.6 s | 2746.4 s |
BFAN [39] | 12.9 s | 1158.4 s | 58.7 s | 5744.2 s |
CAMP [40] | 4.3 s | 1291.5 s | 19.9 s | 6523.9 s |
MPL [36] | 10.2 s | 648.7 s | 46.3 s | 3021.0 s |
IMRAM [21] | 9.8 s | 680.5 s | 47.7 s | 3417.4 s |
VSRN [2] | 16.7 s | 4.7 s | 74.3 s | 21.6 s |
EISIN | 20.1 s | 4.9 s | 94.3 s | 20.9 s |
Method | Flickr30K | MS-COCO | Params | ||||
---|---|---|---|---|---|---|---|
Epoch | Batch Size | LR | Epoch | Batch Size | LR | ||
VSE++ [13] | 30 | 128 | 0.0002/15/ | 30 | 128 | 0.0002/15/ | 67 M |
SCAN [20] | 30 | 128 | 0.0002/15/ | 20 | 128 | 0.0005/10/ | 9 M |
VSRN [2] | 30 | 128 | 0.0002/15/ | 30 | 128 | 0.0002/15/ | 140 M |
CAMERA [9] | 30 | 128 | 0.0001/10/ | 40 | 128 | 0.0001/20/ | 156 M |
SEAM [38] | 30 | 64 | 0.0001/10/ | 30 | 128 | 0.0002/10/ | 114 M |
HREM [3] | 30 | 128 | 0.0002/15/ | 30 | 128 | 0.0002/15/ | 131 M |
EISIN | 30 | 128 | 0.0002/15/ | 30 | 128 | 0.0002/15/ | 126 M |
Method | Text Retrieval | Image Retrieval | rSum | ||||
---|---|---|---|---|---|---|---|
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||
w/o AL | 57.5 | 82.4 | 89.2 | 42.6 | 72.2 | 82.7 | 427.1 |
w/o SEL | 58.7 | 83.9 | 90.7 | 41.7 | 72.0 | 82.3 | 429.4 |
w/o SEL and AL | 56.8 | 82.3 | 89.6 | 42.5 | 72.9 | 82.5 | 426.7 |
EISIN | 59.6 | 84.4 | 91.2 | 42.0 | 72.3 | 82.5 | 432.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Peng, S.; Wang, Z.; Liu, J.; Zhang, C.; Jia, L. Fine-Grained Local and Global Semantic Fusion for Multimodal Image–Text Retrieval. Big Data Cogn. Comput. 2025, 9, 53. https://doi.org/10.3390/bdcc9030053
Peng S, Wang Z, Liu J, Zhang C, Jia L. Fine-Grained Local and Global Semantic Fusion for Multimodal Image–Text Retrieval. Big Data and Cognitive Computing. 2025; 9(3):53. https://doi.org/10.3390/bdcc9030053
Chicago/Turabian StylePeng, Shenao, Zhongmei Wang, Jianhua Liu, Changfan Zhang, and Lin Jia. 2025. "Fine-Grained Local and Global Semantic Fusion for Multimodal Image–Text Retrieval" Big Data and Cognitive Computing 9, no. 3: 53. https://doi.org/10.3390/bdcc9030053
APA StylePeng, S., Wang, Z., Liu, J., Zhang, C., & Jia, L. (2025). Fine-Grained Local and Global Semantic Fusion for Multimodal Image–Text Retrieval. Big Data and Cognitive Computing, 9(3), 53. https://doi.org/10.3390/bdcc9030053