CoCM: Conditional Cross-Modal Learning for Vision-Language Models
Abstract
:1. Introduction
- We propose a novel cross-modal CoCM model that builds separate cache models for the image and text domains while integrating image features into text-based prompt learning. This approach constructs a cross-modal cache model, enabling multimodal metric classification and significantly improving the accuracy of model predictions.
- The model extracts inference cues by integrating vision-language bimodal information, dynamically adjusts the cross-modal fusion affinity ratio, and disentangles the similarity measures of different modalities. It further enhances model performance by adaptively adjusting the learning intensity for these samples. Additionally, the model incorporates similarity loss among images within a batch as a constraint, improving fine-grained classification performance.
- CoCM exhibits exceptional representation learning capabilities, achieving outstanding performance on 11 classification and recognition datasets. Additionally, its strong generalization ability has been demonstrated across four benchmark datasets.
2. Related Work
2.1. Vision-Language Model
2.2. Parameter-Efficient Transfer Learning
2.3. Cache Model
2.4. Hard Samples Mining
3. Methodology
3.1. Preliminaries
3.2. Image Cache Model Construction
3.3. Cross-Modal Cache Model Construction
3.4. Adaptive Scaling
3.5. Building Logits
3.6. Loss Function
3.7. Algorithm Description
Algorithm 1 CoCM Algorithm Description |
|
4. Experiments
4.1. Experimental Setups
4.2. Cross Label Generalization
4.3. Domain Generalization
4.4. Model Complexity
4.5. Ablation Studies
4.5.1. Different BackBones
4.5.2. Different Hyper-Parameters
4.5.3. Different Coefficient and
4.5.4. Analysis of Hard Example Mining
4.5.5. Analysis of Similarity Loss
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR. pp. 8748–8762. [Google Scholar]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR. pp. 4904–4916. [Google Scholar]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; PMLR. pp. 12888–12900. [Google Scholar]
- Lester, B.; Al-Rfou, R.; Constant, N. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event/Punta Cana, Dominican Republic, 7–11 November 2021; Association for Computational Linguistics. pp. 3045–3059. [Google Scholar]
- Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; PMLR. pp. 2790–2799. [Google Scholar]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Volume 33. [Google Scholar]
- Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
- Gao, P.; Geng, S.; Zhang, R.; Ma, T.; Fang, R.; Zhang, Y.; Li, H.; Qiao, Y. Clip-adapter: Better vision-language models with feature adapters. Int. J. Comput. Vis. 2023, 132, 581–595. [Google Scholar] [CrossRef]
- Zhang, R.; Zhang, W.; Fang, R.; Gao, P.; Li, K.; Dai, J.; Qiao, Y.; Li, H. Tip-adapter: Training-free adaption of clip for few-shot classification. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 493–510. [Google Scholar]
- Kim, J.H.; Jun, J.; Zhang, B.T. Bilinear attention networks. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
- Yu, Z.; Yu, J.; Cui, Y.; Tao, D.; Tian, Q. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6281–6290. [Google Scholar]
- Gao, P.; Jiang, Z.; You, H.; Lu, P.; Hoi, S.C.; Wang, X.; Li, H. Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6639–6648. [Google Scholar]
- Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (Long and Short Papers). Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Florence, Italy, 2019; pp. 4171–4186. [Google Scholar]
- Tan, H.; Bansal, M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019; pp. 5099–5110. [Google Scholar]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
- Chen, Y.; Li, L.; Yu, L.; Kholy, A.E.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. UNITER: Learning UNiversal Image-TExt Representations. arXiv 2019, arXiv:1909.11740. [Google Scholar]
- Li, Y.; Liang, F.; Zhao, L.; Cui, Y.; Ouyang, W.; Shao, J.; Yu, F.; Yan, J. Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. In Proceedings of the The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. [Google Scholar]
- Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16816–16825. [Google Scholar]
- Yu, T.; Lu, Z.; Jin, X.; Chen, Z.; Wang, X. Task residual for tuning vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10899–10909. [Google Scholar]
- Yang, J.; Li, Z.; Xie, S.; Zhu, W.; Yu, W.; Li, S. Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models. In Proceedings of the IEEE International Conference on Multimedia and Expo, ICME 2024, Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]
- Li, X.; Lian, D.; Lu, Z.; Bai, J.; Chen, Z.; Wang, X. Graphadapter: Tuning vision-language models with dual knowledge graph. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Volume 36. [Google Scholar]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Volume 36. [Google Scholar]
- Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; Wang, P.; Lin, J.; Zhou, C.; Zhou, J. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv 2023, arXiv:2308.12966. [Google Scholar]
- Liu, X.; Ji, K.; Fu, Y.; Tam, W.; Du, Z.; Yang, Z.; Tang, J. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Dublin, Ireland, 22–27 May 2022; pp. 61–68. [Google Scholar]
- Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef]
- Liu, X.; Ji, K.; Fu, Y.; Tam, W.L.; Du, Z.; Yang, Z.; Tang, J. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv 2021, arXiv:2110.07602. [Google Scholar]
- Jia, M.; Tang, L.; Chen, B.C.; Cardie, C.; Belongie, S.; Hariharan, B.; Lim, S.N. Visual prompt tuning. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XXXIII. Springer: Berlin/Heidelberg, Germany, 2022; pp. 709–727. [Google Scholar]
- Zhang, Y.; Zhou, K.; Liu, Z. Neural prompt search. arXiv 2022, arXiv:2206.04673. [Google Scholar] [CrossRef]
- Yang, X.; Cheng, W.; Zhao, X.; Petzold, L.; Chen, H. Dynamic Prompting: A Unified Framework for Prompt Tuning. arXiv 2023, arXiv:2303.02909. [Google Scholar]
- Zang, Y.; Li, W.; Zhou, K.; Huang, C.; Loy, C.C. Unified vision and language prompt learning. arXiv 2022, arXiv:2210.07225. [Google Scholar]
- Khattak, M.U.; Rasheed, H.; Maaz, M.; Khan, S.; Khan, F.S. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19113–19122. [Google Scholar]
- Grave, E.; Cisse, M.M.; Joulin, A. Unbounded cache model for online language modeling with open vocabulary. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D. Matching networks for one shot learning. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
- Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; PMLR. pp. 1126–1135. [Google Scholar]
- Robinson, J.D.; Chuang, C.; Sra, S.; Jegelka, S. Contrastive Learning with Hard Negative Samples. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
- Wu, T.; Ding, X.; Zhang, H.; Gao, J.; Tang, M.; Du, L.; Qin, B.; Liu, T. Discrimloss: A universal loss for hard samples and incorrect samples discrimination. IEEE Trans. Multimed. 2023, 26, 1957–1968. [Google Scholar] [CrossRef]
- Song, S.; Bae, H. Hard-negative Sampling with Cascaded Fine-Tuning Network to Boost Flare Removal Performance in the Nighttime Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2843–2852. [Google Scholar]
- Wang, K.; Peng, Y.; Huang, H.; Hu, Y.; Li, S. Mining hard samples locally and globally for improved speech separation. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 6037–6041. [Google Scholar]
- Liu, Z.; Li, S.; Wang, G.; Wu, L.; Tan, C.; Li, S.Z. Harnessing hard mixed samples with decoupled regularizer. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Volume 36. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Fei-Fei, L.; Fergus, R.; Perona, P. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition Workshop, Washington, DC, USA, 27 June–2 July 2004; p. 178. [Google Scholar]
- Parkhi, O.M.; Vedaldi, A.; Zisserman, A.; Jawahar, C. Cats and dogs. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 July 2012; pp. 3498–3505. [Google Scholar]
- Krause, J.; Stark, M.; Deng, J.; Fei-Fei, L. 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Sydney, Australia, 2–8 December 2013; pp. 554–561. [Google Scholar]
- Nilsback, M.E.; Zisserman, A. Automated flower classification over a large number of classes. In Proceedings of the 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, Bhubaneswar, India, 16–19 December 2008; pp. 722–729. [Google Scholar]
- Bossard, L.; Guillaumin, M.; Van Gool, L. Food-101–mining discriminative components with random forests. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part VI 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 446–461. [Google Scholar]
- Maji, S.; Rahtu, E.; Kannala, J.; Blaschko, M.; Vedaldi, A. Fine-grained visual classification of aircraft. arXiv 2013, arXiv:1306.5151. [Google Scholar]
- Xiao, J.; Hays, J.; Ehinger, K.A.; Oliva, A.; Torralba, A. Sun database: Large-scale scene recognition from abbey to zoo. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 3485–3492. [Google Scholar]
- Cimpoi, M.; Maji, S.; Kokkinos, I.; Mohamed, S.; Vedaldi, A. Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3606–3613. [Google Scholar]
- Helber, P.; Bischke, B.; Dengel, A.; Borth, D. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 2217–2226. [Google Scholar] [CrossRef]
- Soomro, K.; Zamir, A.R.; Shah, M. A dataset of 101 human action classes from videos in the wild. Cent. Res. Comput. Vis. 2012, 2, 1–7. [Google Scholar]
- Recht, B.; Roelofs, R.; Schmidt, L.; Shankar, V. Do imagenet classifiers generalize to imagenet? In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; PMLR. pp. 5389–5400. [Google Scholar]
- Wang, H.; Ge, S.; Lipton, Z.; Xing, E.P. Learning robust global representations by penalizing local predictive power. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
- Hendrycks, D.; Zhao, K.; Basart, S.; Steinhardt, J.; Song, D. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15262–15271. [Google Scholar]
- Hendrycks, D.; Basart, S.; Mu, N.; Kadavath, S.; Wang, F.; Dorundo, E.; Desai, R.; Zhu, T.; Parajuli, S.; Guo, M.; et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 8340–8349. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Heigold, G.; Gelly, S.; Uszkoreit, J.; Houlsby, N. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Machine Learning, Virtual, 3–7 May 2021. [Google Scholar]
Method | Backbone | Source | Target | ||||
---|---|---|---|---|---|---|---|
ImageNet | −V2 | −Sketch | −A | −R | Average | ||
Zero-shot CLIP [1] | ResNet-50 | 58.18 | 51.34 | 33.32 | 21.65 | 56.00 | 40.58 |
Linear Probe CLIP [1] | 55.87 | 45.97 | 19.07 | 12.74 | 28.16 | 26.49 | |
CoOp [8] | 62.95 | 55.11 | 32.74 | 22.12 | 54.96 | 41.23 | |
TaskRes [20] | 64.75 | 56.47 | 35.83 | 22.80 | 60.70 | 43.95 | |
GraphAdapter [22] | 65.70 | 56.40 | 34.50 | 21.88 | 58.94 | 42.93 | |
XMAdapter [21] | 66.22 | 56.51 | 36.72 | 23.46 | 61.53 | 44.56 | |
CoCM | 66.51 | 56.83 | 36.91 | 23.83 | 61.61 | 44.79 | |
Zero-shot CLIP [1] | ResNet-101 | 61.62 | 54.81 | 38.71 | 28.05 | 64.38 | 46.49 |
Linear Probe CLIP [1] | 59.75 | 50.05 | 26.80 | 19.44 | 47.19 | 35.87 | |
CoOp [8] | 66.60 | 58.66 | 39.08 | 28.89 | 63.00 | 47.41 | |
TaskRes [20] | 67.70 | 59.50 | 41.70 | 29.87 | 68.07 | 49.79 | |
GraphAdapter [22] | 68.23 | 59.60 | 40.83 | 28.77 | 67.13 | 49.08 | |
XMAdapter [21] | 68.96 | 59.64 | 41.50 | 30.57 | 68.82 | 50.13 | |
CoCM | 69.23 | 59.87 | 41.63 | 30.83 | 68.97 | 50.33 | |
Zero-shot CLIP [1] | ViT-B/32 | 62.05 | 54.79 | 40.82 | 29.57 | 65.99 | 47.79 |
Linear Probe CLIP [1] | 59.58 | 49.73 | 28.06 | 19.67 | 47.20 | 36.17 | |
CoOp [8] | 66.85 | 58.08 | 40.44 | 30.62 | 64.45 | 48.40 | |
TaskRes [20] | 68.20 | 59.20 | 42.50 | 31.43 | 69.33 | 50.62 | |
GraphAdapter [22] | 68.80 | 59.00 | 41.70 | 29.57 | 68.67 | 49.74 | |
XMAdapter [21] | 69.56 | 59.12 | 42.91 | 31.95 | 69.57 | 50.89 | |
CoCM | 69.72 | 59.45 | 42.85 | 32.32 | 69.78 | 51.10 | |
Zero-shot CLIP [1] | ViT-B/16 | 66.73 | 60.83 | 46.15 | 47.77 | 73.96 | 57.18 |
Linear Probe CLIP [1] | 65.85 | 56.26 | 34.77 | 35.68 | 58.43 | 46.29 | |
CoOp [8] | 71.92 | 64.18 | 46.71 | 48.41 | 74.32 | 58.41 | |
TaskRes [20] | 73.07 | 65.30 | 49.13 | 50.37 | 77.70 | 60.63 | |
GraphAdapter [22] | 73.68 | 65.57 | 48.57 | 49.23 | 77.20 | 60.14 | |
XMAdapter [21] | 74.43 | 65.54 | 49.58 | 50.69 | 77.95 | 60.94 | |
CoCM | 74.58 | 65.91 | 49.77 | 50.83 | 77.87 | 61.10 |
Models | Tunable Parameters (M) | GFlops | Training Time (One Epoch) (s) | Inference Time (s/100) | GPU Memory | Performance |
---|---|---|---|---|---|---|
CoOp [8] | 0.008 | 1943.12 | 40.91 | 119.64 | 18.907 | 62.95 |
CLIP-Adapter [9] | 0.524 | 1959.44 | 45.71 | 275.22 | 9.257 | 63.59 |
Tip-Adapter [10] | 16.384 | 5.43 | 12.36 | 51.03 | 4.313 | 65.44 |
TaskRes [20] | 1.024 | 5.42 | 13.64 | 4.89 | 6.227 | 64.75 |
GraphAdapter [22] | 4.145 | 5.42 | 23.29 | 4.91 | 10.75 | 65.70 |
XMAdapter [21] | 18.561 | 5.39 | 13.41 | 63.24 | 5.148 | 66.22 |
CoCM | 17.724 | 5.41 | 14.66 | 72.16 | 4.823 | 66.51 |
Models | ResNet-50 | ResNet-101 | ViT-B/32 | ViT-B/16 |
---|---|---|---|---|
Zero-shot CLIP [1] | 58.18 | 61.62 | 62.05 | 66.73 |
CoOp [8] | 62.95 | 66.60 | 66.85 | 71.92 |
CLIP-Adapter [9] | 63.59 | 65.39 | 66.19 | 71.13 |
Tip-Adapter [10] | 65.44 | 68.56 | 68.65 | 73.69 |
GraphAdapter [22] | 65.70 | 68.23 | 68.80 | 73.68 |
XMAdapter [21] | 66.22 | 68.96 | 69.56 | 74.43 |
CoCM | 66.51 | 69.17 | 69.73 | 74.61 |
Hyper-Parameters | The Value of | ||||||
---|---|---|---|---|---|---|---|
0 | 0.1 | 0.3 | 0.5 | 0.7 | 0.9 | 1.0 | |
Acc | 62.95 | 64.82 | 65.43 | 65.81 | 66.51 | 65.73 | 65.44 |
Hyper-Parameters | The Value of and | ||||||
---|---|---|---|---|---|---|---|
Residual Ratio | 0.0 | 0.5 | 1.0 | 1.2 | 2.0 | 3.0 | 4.0 |
58.18 | 64.62 | 65.83 | 66.51 | 65.57 | 63.46 | 62.54 | |
Sharpness Ratio | 0.5 | 1.5 | 3.5 | 5.5 | 7.5 | 9.5 | 11.5 |
64.62 | 64.81 | 66.51 | 65.24 | 64.36 | 64.11 | 65.56 |
Models | ImageNet | StanfordCars | FGVCAircraft | DTD | EuroSTA | Average |
---|---|---|---|---|---|---|
Zero-shot CLIP [1] | 58.18 | 55.61 | 17.28 | 42.32 | 37.56 | 42.19 |
Tip-Adapter [10] | 65.44 | 75.75 | 35.86 | 66.94 | 84.94 | 65.79 |
XMAdapter [21] | 66.22 | 76.85 | 37.69 | 68.25 | 86.13 | 67.02 |
w/o hard example | 66.14 | 77.16 | 37.23 | 68.11 | 85.55 | 66.84 |
CoCM | 66.51 | 77.35 | 37.94 | 68.56 | 86.43 | 67.36 |
Models | Flowers102 | StanfordCars | FGVCAircraft | OxfordPets | Food101 | Average |
---|---|---|---|---|---|---|
Zero-shot CLIP [1] | 66.14 | 55.61 | 17.28 | 85.77 | 77.31 | 60.42 |
Tip-Adapter [10] | 94.23 | 75.75 | 35.86 | 88.18 | 78.11 | 74.43 |
XMAdapter [21] | 96.76 | 76.85 | 37.69 | 88.85 | 78.96 | 75.82 |
w/o similarity loss | 96.81 | 77.12 | 37.78 | 88.86 | 79.23 | 75.96 |
CoCM | 96.93 | 77.35 | 37.94 | 88.94 | 79.54 | 76.14 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, J.; Xie, S.; Li, S.; Cai, Z.; Li, Y.; Zhu, W. CoCM: Conditional Cross-Modal Learning for Vision-Language Models. Electronics 2025, 14, 26. https://doi.org/10.3390/electronics14010026
Yang J, Xie S, Li S, Cai Z, Li Y, Zhu W. CoCM: Conditional Cross-Modal Learning for Vision-Language Models. Electronics. 2025; 14(1):26. https://doi.org/10.3390/electronics14010026
Chicago/Turabian StyleYang, Juncheng, Shuai Xie, Shuxia Li, Zengyu Cai, Yijia Li, and Weiping Zhu. 2025. "CoCM: Conditional Cross-Modal Learning for Vision-Language Models" Electronics 14, no. 1: 26. https://doi.org/10.3390/electronics14010026
APA StyleYang, J., Xie, S., Li, S., Cai, Z., Li, Y., & Zhu, W. (2025). CoCM: Conditional Cross-Modal Learning for Vision-Language Models. Electronics, 14(1), 26. https://doi.org/10.3390/electronics14010026