Next Article in Journal
Fast Temperature Calculation Method for Spindle Servo Permanent Magnet Motors Under Full Operating Conditions Based on the Thermal Network Method
Previous Article in Journal
Research on Intelligent Verification of Equipment Information in Engineering Drawings Based on Deep Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hierarchical Vision–Language Pre-Training with Freezing Strategy for Multi-Level Semantic Alignment

1
Engineering Comprehensive Training Center, Guilin University of Aerospace Technology, Guilin 541004, China
2
School of Artificial Intelligence, Guangxi Colleges and Universities Key Laboratory of AI Algorithm Engineering, Guilin University of Electronic Technology, Guilin 541004, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(4), 816; https://doi.org/10.3390/electronics14040816
Submission received: 7 January 2025 / Revised: 12 February 2025 / Accepted: 17 February 2025 / Published: 19 February 2025
(This article belongs to the Section Artificial Intelligence)

Abstract

Vision–language pre-training (VLP) faces challenges in aligning hierarchical textual semantics (words/phrases/sentences) with multi-scale visual features (objects/relations/global context). We propose a hierarchical VLP model (HieVLP) that addresses such challenges through semantic decomposition and progressive alignment. Textually, a semantic parser deconstructs captions into word-, phrase-, and sentence-level components, which are encoded via hierarchical BERT layers. Visually, a Swin Transformer extracts object- (local), relation- (mid-scale), and global-level features through shifted window hierarchies. During pre-training, a freezing strategy sequentially activates text layers (sentence→phrase→word), aligning each with the corresponding visual scales via contrastive and language modeling losses. The experimental evaluations demonstrate that HieVLP outperforms hierarchical baselines across various tasks, with the performance improvements ranging from approximately 3.2% to 11.2%. In the image captioning task, HieVLP exhibits an average CIDEr improvement of around 7.2% and a 2.1% improvement in the SPICE metric. For image–text retrieval, it achieves recall increases of 4.7–6.8%. In reasoning tasks, HieVLP boosts accuracy by 2.96–5.8%. These results validate that explicit multi-level alignment enables contextually coherent caption generation and precise cross-modal reasoning.
Keywords: vision-and-language; transformer; multi-modal pre-training; multi-level multi-scale alignment vision-and-language; transformer; multi-modal pre-training; multi-level multi-scale alignment

Share and Cite

MDPI and ACS Style

Xie, H.; Qin, Y.; Ding, S. Hierarchical Vision–Language Pre-Training with Freezing Strategy for Multi-Level Semantic Alignment. Electronics 2025, 14, 816. https://doi.org/10.3390/electronics14040816

AMA Style

Xie H, Qin Y, Ding S. Hierarchical Vision–Language Pre-Training with Freezing Strategy for Multi-Level Semantic Alignment. Electronics. 2025; 14(4):816. https://doi.org/10.3390/electronics14040816

Chicago/Turabian Style

Xie, Huiming, Yang Qin, and Shuxue Ding. 2025. "Hierarchical Vision–Language Pre-Training with Freezing Strategy for Multi-Level Semantic Alignment" Electronics 14, no. 4: 816. https://doi.org/10.3390/electronics14040816

APA Style

Xie, H., Qin, Y., & Ding, S. (2025). Hierarchical Vision–Language Pre-Training with Freezing Strategy for Multi-Level Semantic Alignment. Electronics, 14(4), 816. https://doi.org/10.3390/electronics14040816

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop