Next Article in Journal
A New Method for Defining the Optimal Separation Gap Distance and the Acceptable Structural Pounding Risk on Multistory RC Structures
Next Article in Special Issue
A Modular Framework for Domain-Specific Conversational Systems Powered by Never-Ending Learning
Previous Article in Journal
Polarization-Sensitive Quantum Optical Coherence Tomography: Birefringence Profiles of Biological Samples
Previous Article in Special Issue
A Robust Chinese Named Entity Recognition Method Based on Integrating Dual-Layer Features and CSBERT
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

VL-Few: Vision Language Alignment for Multimodal Few-Shot Meta Learning

Faculty of Applied Sciences, Macao Polytechnic University, Macao 999078, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(3), 1169; https://doi.org/10.3390/app14031169
Submission received: 1 January 2024 / Revised: 22 January 2024 / Accepted: 29 January 2024 / Published: 30 January 2024
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications—2nd Edition)

Abstract

Complex tasks in the real world involve different modal models, such as visual question answering (VQA). However, traditional multimodal learning requires a large amount of aligned data, such as image text pairs, and constructing a large amount of training data is a challenge for multimodal learning. Therefore, we propose VL-Few, which is a simple and effective method to solve the multimodal few-shot problem. VL-Few (1) proposes the modal alignment, which aligns visual features into language space through a lightweight model network and improves the multimodal understanding ability of the model; (2) adopts few-shot meta learning in the multimodal problem, which constructs a few-shot meta task pool to improve the generalization ability of the model; (3) proposes semantic alignment to enhance the semantic understanding ability of the model for the task, context, and demonstration; (4) proposes task alignment that constructs training data into the target task form and improves the task understanding ability of the model; (5) proposes generation alignment, which adopts the token-level training and multitask fusion loss to improve the generation ability of the model. Our experimental results show the effectiveness of VL-Few for multimodal few-shot problems.
Keywords: vision language learning; representation alignment; multimodal learning; meta learning; few-shot learning; visual question answering vision language learning; representation alignment; multimodal learning; meta learning; few-shot learning; visual question answering

Share and Cite

MDPI and ACS Style

Ma, H.; Fan, B.; Ng, B.K.; Lam, C.-T. VL-Few: Vision Language Alignment for Multimodal Few-Shot Meta Learning. Appl. Sci. 2024, 14, 1169. https://doi.org/10.3390/app14031169

AMA Style

Ma H, Fan B, Ng BK, Lam C-T. VL-Few: Vision Language Alignment for Multimodal Few-Shot Meta Learning. Applied Sciences. 2024; 14(3):1169. https://doi.org/10.3390/app14031169

Chicago/Turabian Style

Ma, Han, Baoyu Fan, Benjamin K. Ng, and Chan-Tong Lam. 2024. "VL-Few: Vision Language Alignment for Multimodal Few-Shot Meta Learning" Applied Sciences 14, no. 3: 1169. https://doi.org/10.3390/app14031169

APA Style

Ma, H., Fan, B., Ng, B. K., & Lam, C.-T. (2024). VL-Few: Vision Language Alignment for Multimodal Few-Shot Meta Learning. Applied Sciences, 14(3), 1169. https://doi.org/10.3390/app14031169

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop