Soft Contrastive Cross-Modal Retrieval
Abstract
:1. Introduction
- We propose a novel end-to-end cross-modal retrieval model, termed the Soft Contrastive Cross-Modal Retrieval (SCCMR), which aims to combine soft contrastive learning and cross-modal learning. To the best of our knowledge, this is the first work to fuse soft contrastive learning into cross-modal retrieval to improve multimodal feature embedding.
- To solve the sharp and tortuous margin problem of feature embedding, we use soft contrastive learning for common subspace learning, which assists the model in finding the smooth boundary of embedding representation. To balance the noise data, we utilize smooth label cross-modal learning, which promotes the robustness of the model on real multimedia data.
- We carry out extensive experiments on three benchmark multimedia datasets, Wikipedia, NUS-WIDE, and Pascal Sentence. Results illuminate that our proposed method can outperform the current state-of-the-art methods in cross-modal retrieval, which demonstrates the excellent effectiveness of the proposed method.
2. Related Work
2.1. Cross-Modal Retrieval
2.2. Contrastive Learning
3. Methodology
3.1. Problem Definition
Notation | Definition |
---|---|
a multimedia dataset | |
the i-th image sample | |
the i-th text sample | |
the semantic label vector of the i-th image–text pair | |
Q | a cross-modal query |
the set of results | |
N | the number of samples in a batch |
the multimodal semantic similarity function | |
the mapping function of the visual network | |
the mapping function of the textual network | |
the parameters of the visual network | |
the parameters of the textual network | |
hyperparameter, temperature | |
hyperparameter, smooth parameter | |
the prediction value of the label classifier |
3.2. The Overview of SCCMR
3.3. Cross-Modal Feature Learning
3.4. Soft Contrastive Learning
3.5. Smooth Label Cross-Modal Learning
3.6. Optimization
Algorithm 1: Pseudocode of optimizing our SCCMR |
|
4. Experiment
4.1. Setting
4.1.1. Datasets
4.1.2. Baselines
- CCA is a multivariate statistical method used to analyze the correlation between two sets of variables, often employed in cross-modal retrieval to learn correlations between different data modalities such as images and text.
- SM uses machine learning technology to represent cross-modal data as semantic vectors, facilitating retrieval and matching based on semantic content.
- MCCA is a statistical approach designed to explore relationships between multiple datasets, with the goal of finding linear associations that reveal shared structures or patterns.
- MvDA is an analysis method focused on maximizing the variance between different views of the same object or event while minimizing the variance within views of different objects or events.
- MvDA-VC is an extension of MvDA, which aims to maintain consistency across different views in the low-dimensional embedding space, thereby more effectively capturing the correlation and structural information between views.
- JRL is a method that improves information sharing and complementarity between different modalities or perspectives, leading to enhanced generalization ability and model performance.
- Deep-SM employs deep learning techniques for semantic matching, embedding data from different modalities like text and images into a neural network that facilitates cross-modal semantic similarity calculations.
- CMDN uses two multi-modal neural networks to embed and represent data, performing matching in a low-dimensional semantic space for cross-modal semantic similarity assessment.
- DCCAE consists of two main encoders and decoders that work together to encode input data into a low-dimensional representation and then attempt to reconstruct the original data, learning a compact representation for each modality.
- ACMR employs adversarial generative network techniques (specifically, Generative Adversarial Networks or GANs) to learn feature representations between modalities, using adversarial training to refine these representations.
- MG-HSF combines hierarchical semantic fusion with cross-modal adversarial learning to capture both fine-grained and coarse-grained semantic knowledge, generating modality-invariant representations in a common subspace.
4.1.3. Evaluation Metrics
4.1.4. Implementation Details
4.2. Performance Evaluation
4.2.1. Compared with State of the Art
4.2.2. Hyperparameter Sensitivity Analysis
4.2.3. Ablation Study
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. SCL Loss Is Smoother than CL Loss
References
- Zhen, L.; Hu, P.; Wang, X.; Peng, D. Deep supervised cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10394–10403. [Google Scholar]
- Cao, Y.; Long, M.; Wang, J.; Yang, Q.; Yu, P.S. Deep visual-semantic hashing for cross-modal retrieval. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1445–1454. [Google Scholar]
- Costa Pereira, J.; Coviello, E.; Doyle, G.; Rasiwasia, N.; Lanckriet, G.; Levy, R.; Vasconcelos, N. On the role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval. Trans. Pattern Anal. Mach. Intell. 2014, 36, 521–535. [Google Scholar] [CrossRef]
- Chen, Y.; Yuan, J.; Tian, Y.; Geng, S.; Li, X.; Zhou, D.; Metaxas, D.N.; Yang, H. Revisiting multimodal representation in contrastive learning: From patch and token embeddings to finite discrete tokens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 15095–15104. [Google Scholar]
- Lin, Z.; Bas, E.; Singh, K.Y.; Swaminathan, G.; Bhotika, R. Relaxing contrastiveness in multimodal representation learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 2227–2236. [Google Scholar]
- Xu, P.; Zhu, X.; Clifton, D.A. Multimodal learning with transformers: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12113–12132. [Google Scholar] [CrossRef]
- Fan, Y.; Xu, W.; Wang, H.; Wang, J.; Guo, S. PMR: Prototypical Modal Rebalance for Multimodal Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 20029–20038. [Google Scholar]
- Jin, P.; Huang, J.; Xiong, P.; Tian, S.; Liu, C.; Ji, X.; Yuan, L.; Chen, J. Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 2472–2482. [Google Scholar]
- Rocco, I.; Arandjelović, R.; Sivic, J. End-to-End Weakly-Supervised Semantic Alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Huang, S.; Wang, Q.; Zhang, S.; Yan, S.; He, X. Dynamic context correspondence network for semantic alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2010–2019. [Google Scholar]
- Hao, F.; He, F.; Cheng, J.; Wang, L.; Cao, J.; Tao, D. Collect and select: Semantic alignment metric learning for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8460–8469. [Google Scholar]
- Liu, Z.; Zhu, X.; Hu, G.; Guo, H.; Tang, M.; Lei, Z.; Robertson, N.M.; Wang, J. Semantic alignment: Finding semantically consistent ground-truth for facial landmark detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3467–3476. [Google Scholar]
- Hotelling, H. Relations Between Two Sets of Variates. Biometrika 1936, 28, 321–377. [Google Scholar] [CrossRef]
- Rupnik, J.; Shawe-Taylor, J. Multi-view canonical correlation analysis. In Proceedings of the Conference on Data Mining and Data Warehouses (SiKDD 2010), Ljubljana, Slovenia, 12 October 2010; pp. 1–4. [Google Scholar]
- Sze, V.; Chen, Y.H.; Yang, T.J.; Emer, J.S. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 2017, 105, 2295–2329. [Google Scholar] [CrossRef]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
- Marchetti, G.L.; Tegnér, G.; Varava, A.; Kragic, D. Equivariant representation learning via class-pose decomposition. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Valencia, Spain, 25–27 April 2023; pp. 4745–4756. [Google Scholar]
- Tao, C.; Zhu, X.; Su, W.; Huang, G.; Li, B.; Zhou, J.; Qiao, Y.; Wang, X.; Dai, J. Siamese image modeling for self-supervised vision representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 2132–2141. [Google Scholar]
- Li, T.; Chang, H.; Mishra, S.; Zhang, H.; Katabi, D.; Krishnan, D. Mage: Masked generative encoder to unify representation learning and image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 2142–2152. [Google Scholar]
- Morioka, H.; Hyvarinen, A. Connectivity-contrastive learning: Combining causal discovery and representation learning for multimodal data. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Valencia, Spain, 25–27 April 2023; pp. 3399–3426. [Google Scholar]
- Cai, S.; Wang, Z.; Ma, X.; Liu, A.; Liang, Y. Open-world multi-task control through goal-aware representation learning and adaptive horizon prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 13734–13744. [Google Scholar]
- Zhu, L.; Song, J.; Zhu, X.; Zhang, C.; Zhang, S.; Yuan, X. Adversarial learning-based semantic correlation representation for cross-modal retrieval. IEEE MultiMedia 2020, 27, 79–90. [Google Scholar] [CrossRef]
- Zhu, L.; Zhang, C.; Song, J.; Liu, L.; Zhang, S.; Li, Y. Multi-graph based hierarchical semantic fusion for cross-modal representation. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
- Jiang, D.; Ye, M. Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 2787–2797. [Google Scholar]
- Liu, Y.; Li, G.; Lin, L. Cross-modal causal relational reasoning for event-level visual question answering. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11624–11641. [Google Scholar] [CrossRef]
- Chun, S.; Oh, S.J.; De Rezende, R.S.; Kalantidis, Y.; Larlus, D. Probabilistic embeddings for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8415–8424. [Google Scholar]
- Song, Y.; Soleymani, M. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1979–1988. [Google Scholar]
- Kan, M.; Shan, S.; Zhang, H.; Lao, S.; Chen, X. Multi-View Discriminant Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 188–194. [Google Scholar] [CrossRef]
- Zhai, X.; Peng, Y.; Xiao, J. Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans. Circuits Syst. Video Technol. 2013, 24, 965–978. [Google Scholar] [CrossRef]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
- Zhu, L.; Zhang, C.; Song, J.; Zhang, S.; Tian, C.; Zhu, X. Deep multigraph hierarchical enhanced semantic representation for cross-modal retrieval. IEEE MultiMedia 2022, 29, 17–26. [Google Scholar] [CrossRef]
- Wang, B.; Yang, Y.; Xu, X.; Hanjalic, A.; Shen, H.T. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 19–21 October 2017; pp. 154–162. [Google Scholar]
- Wei, Y.; Zhao, Y.; Lu, C.; Wei, S.; Liu, L.; Zhu, Z.; Yan, S. Cross-modal retrieval with CNN visual features: A new baseline. IEEE Trans. Cybern. 2016, 47, 449–460. [Google Scholar] [CrossRef]
- Peng, Y.; Huang, X.; Qi, J. Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI-16), New York, NY, USA, 9–15 July 2016; Volume 3846, p. 3853. [Google Scholar]
- Andrew, G.; Arora, R.; Bilmes, J.; Livescu, K. Deep canonical correlation analysis. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1247–1255. [Google Scholar]
- Wang, W.; Arora, R.; Livescu, K.; Bilmes, J. On deep multi-view representation learning. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1083–1092. [Google Scholar]
- Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
- Le-Khac, P.H.; Healy, G.; Smeaton, A.F. Contrastive representation learning: A framework and review. IEEE Access 2020, 8, 193907–193934. [Google Scholar] [CrossRef]
- Zang, Z.; Shang, L.; Yang, S.; Wang, F.; Sun, B.; Xie, X.; Li, S.Z. Boosting Novel Category Discovery Over Domains with Soft Contrastive Learning and All in One Classifier. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 11858–11867. [Google Scholar]
- Wang, X.; Han, X.; Huang, W.; Dong, D.; Scott, M.R. Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5022–5030. [Google Scholar]
- Cui, C.; Ma, Y.; Cao, X.; Ye, W.; Zhou, Y.; Liang, K.; Chen, J.; Lu, J.; Yang, Z.; Liao, K.D.; et al. A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–10 January 2024; pp. 958–979. [Google Scholar]
- Sarukkai, V.; Li, L.; Ma, A.; Ré, C.; Fatahalian, K. Collage diffusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–10 January 2024; pp. 4208–4217. [Google Scholar]
- Zhang, C.; Song, J.; Zhu, X.; Zhu, L.; Zhang, S. Hcmsl: Hybrid cross-modal similarity learning for cross-modal retrieval. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2021, 17, 1–22. [Google Scholar] [CrossRef]
- Nie, Y.; Chen, H.; Bansal, M. Combining fact extraction and verification with neural semantic matching networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6859–6866. [Google Scholar]
- Barlow, H.B. Unsupervised learning. Neural Comput. 1989, 1, 295–311. [Google Scholar] [CrossRef]
- Dy, J.G.; Brodley, C.E. Feature selection for unsupervised learning. J. Mach. Learn. Res. 2004, 5, 845–889. [Google Scholar]
- Zolfaghari, M.; Zhu, Y.; Gehler, P.; Brox, T. Crossclr: Cross-modal contrastive learning for multi-modal video representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1450–1459. [Google Scholar]
- Afham, M.; Dissanayake, I.; Dissanayake, D.; Dharmasiri, A.; Thilakarathna, K.; Rodrigo, R. Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9902–9912. [Google Scholar]
- Kim, D.; Tsai, Y.H.; Zhuang, B.; Yu, X.; Sclaroff, S.; Saenko, K.; Chandraker, M. Learning cross-modal contrastive features for video domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13618–13627. [Google Scholar]
- Feng, C.; Patras, I. Adaptive soft contrastive learning. In Proceedings of the 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 2721–2727. [Google Scholar]
- Zhang, H.; Koh, J.Y.; Baldridge, J.; Lee, H.; Yang, Y. Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 833–842. [Google Scholar]
- Liu, Z.; Xiong, C.; Lv, Y.; Liu, Z.; Yu, G. Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2022. [Google Scholar]
- Huang, Z.; Niu, G.; Liu, X.; Ding, W.; Xiao, X.; Wu, H.; Peng, X. Learning with noisy correspondence for cross-modal matching. Adv. Neural Inf. Process. Syst. 2021, 34, 29406–29419. [Google Scholar]
- Natarajan, N.; Dhillon, I.S.; Ravikumar, P.K.; Tewari, A. Learning with noisy labels. Adv. Neural Inf. Process. Syst. 2013, 26. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2013, 26. [Google Scholar]
- Tian, Y.; Sun, C.; Poole, B.; Krishnan, D.; Schmid, C.; Isola, P. What makes for good views for contrastive learning? Adv. Neural Inf. Process. Syst. 2020, 33, 6827–6839. [Google Scholar]
- Müller, R.; Kornblith, S.; Hinton, G.E. When does label smoothing help? Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
- Rasiwasia, N.; Costa Pereira, J.; Coviello, E.; Doyle, G.; Lanckriet, G.R.; Levy, R.; Vasconcelos, N. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia, Firenze, Italy, 25–29 October 2010; pp. 251–260. [Google Scholar]
- Chua, T.S.; Tang, J.; Hong, R.; Li, H.; Luo, Z.; Zheng, Y. NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, Santorini Island, Greece, 8–10 July 2009; p. 48. [Google Scholar]
- Rashtchian, C.; Young, P.; Hodosh, M.; Hockenmaier, J. Collecting image annotations using amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, CA, USA, 6 June 2010; pp. 139–147. [Google Scholar]
Dataset | Train Set | Test Set | Labels | ||
---|---|---|---|---|---|
Wikipedia | 1300 | 1566 | 10 | 4096 | 3000 |
NUS-WIDE | 8000 | 1000 | 20 | 4096 | 1000 |
Pascal Sentence | 800 | 100 | 20 | 4096 | 300 |
Type | Method | Text2Image | Image2Text | Average |
---|---|---|---|---|
Traditional Method | CCA [13] | 17.84 | 21.01 | 19.43 |
SM [59] | 28.51 | 23.34 | 25.93 | |
MCCA [14] | 30.70 | 34.10 | 32.40 | |
MvDA [28] | 30.80 | 33.70 | 32.30 | |
MvDA-VC [28] | 35.80 | 38.80 | 37.30 | |
JRL [29] | 41.80 | 44.90 | 43.40 | |
Deep Learning Method | Deep-SM [33] | 35.43 | 39.90 | 37.67 |
CMDN [34] | 42.70 | 48.70 | 45.70 | |
DCCA [35] | 39.60 | 44.40 | 42.00 | |
DCCAE [36] | 38.50 | 43.50 | 41.00 | |
ACMR [32] | 43.42 | 47.74 | 45.58 | |
MG-HSF [31] | 53.21 | 52.85 | 53.03 | |
Ours | 61.90 | 49.23 | 55.57 |
Type | Method | Text2Image | Image2Text | Average |
---|---|---|---|---|
Traditional Method | CCA [13] | 36.80 | 38.17 | 37.49 |
SM [59] | 42.37 | 39.16 | 40.77 | |
MCCA [14] | 46.20 | 44.80 | 45.50 | |
MvDA [28] | 52.60 | 50.10 | 51.30 | |
MvDA-VC [28] | 55.70 | 52.60 | 54.20 | |
JRL [29] | 59.80 | 58.60 | 59.20 | |
Deep Learning Method | Deep-SM [33] | 62.55 | 57.80 | 60.18 |
CMDN [34] | 51.50 | 49.20 | 50.40 | |
DCCA [35] | 54.90 | 53.20 | 54.00 | |
DCCAE [36] | 54.00 | 51.11 | 52.50 | |
ACMR [32] | 57.85 | 58.41 | 58.13 | |
MG-HSF [31] | 64.88 | 62.06 | 63.47 | |
Ours | 69.45 | 67.97 | 68.71 |
Type | Method | Text2Image | Image2Text | Average |
---|---|---|---|---|
Traditional Method | CCA [13] | 22.70 | 22.50 | 22.60 |
SM [59] | 21.12 | 18.74 | 20.14 | |
MCCA [14] | 68.90 | 66.40 | 67.70 | |
MvDA [28] | 62.60 | 59.40 | 61.00 | |
MvDA-VC [28] | 67.30 | 64.80 | 66.10 | |
JRL [29] | 53.40 | 52.70 | 53.10 | |
Deep Learning Method | Deep-SM [33] | 48.05 | 44.63 | 46.34 |
CMDN [34] | 52.60 | 54.40 | 53.50 | |
DCCA [35] | 67.70 | 67.80 | 67.80 | |
DCCAE [36] | 67.10 | 68.00 | 67.50 | |
ACMR [32] | 67.60 | 67.10 | 67.35 | |
MG-HSF [31] | 71.55 | 69.62 | 70.59 | |
Ours | 69.55 | 68.33 | 68.94 |
Method | Text2Image | Image2Text | Average | |
---|---|---|---|---|
√ | 69.52 | 67.65 | 68.58 | |
√ | 63.86 | 62.78 | 63.32 | |
√ | √ | 69.45 | 67.97 | 68.71 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Song, J.; Hu, Y.; Zhu, L.; Zhang, C.; Zhang, J.; Zhang, S. Soft Contrastive Cross-Modal Retrieval. Appl. Sci. 2024, 14, 1944. https://doi.org/10.3390/app14051944
Song J, Hu Y, Zhu L, Zhang C, Zhang J, Zhang S. Soft Contrastive Cross-Modal Retrieval. Applied Sciences. 2024; 14(5):1944. https://doi.org/10.3390/app14051944
Chicago/Turabian StyleSong, Jiayu, Yuxuan Hu, Lei Zhu, Chengyuan Zhang, Jian Zhang, and Shichao Zhang. 2024. "Soft Contrastive Cross-Modal Retrieval" Applied Sciences 14, no. 5: 1944. https://doi.org/10.3390/app14051944
APA StyleSong, J., Hu, Y., Zhu, L., Zhang, C., Zhang, J., & Zhang, S. (2024). Soft Contrastive Cross-Modal Retrieval. Applied Sciences, 14(5), 1944. https://doi.org/10.3390/app14051944