Hierarchical Vision–Language Pre-Training with Freezing Strategy for Multi-Level Semantic Alignment

Xie, Huiming; Qin, Yang; Ding, Shuxue

doi:10.3390/electronics14040816

Open AccessArticle

Hierarchical Vision–Language Pre-Training with Freezing Strategy for Multi-Level Semantic Alignment

by

Huiming Xie

¹,

Yang Qin

^2,*

and

Shuxue Ding

²

¹

Engineering Comprehensive Training Center, Guilin University of Aerospace Technology, Guilin 541004, China

²

School of Artificial Intelligence, Guangxi Colleges and Universities Key Laboratory of AI Algorithm Engineering, Guilin University of Electronic Technology, Guilin 541004, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(4), 816; https://doi.org/10.3390/electronics14040816

Submission received: 7 January 2025 / Revised: 12 February 2025 / Accepted: 17 February 2025 / Published: 19 February 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Vision–language pre-training (VLP) faces challenges in aligning hierarchical textual semantics (words/phrases/sentences) with multi-scale visual features (objects/relations/global context). We propose a hierarchical VLP model (HieVLP) that addresses such challenges through semantic decomposition and progressive alignment. Textually, a semantic parser deconstructs captions into word-, phrase-, and sentence-level components, which are encoded via hierarchical BERT layers. Visually, a Swin Transformer extracts object- (local), relation- (mid-scale), and global-level features through shifted window hierarchies. During pre-training, a freezing strategy sequentially activates text layers (sentence→phrase→word), aligning each with the corresponding visual scales via contrastive and language modeling losses. The experimental evaluations demonstrate that HieVLP outperforms hierarchical baselines across various tasks, with the performance improvements ranging from approximately 3.2% to 11.2%. In the image captioning task, HieVLP exhibits an average CIDEr improvement of around 7.2% and a 2.1% improvement in the SPICE metric. For image–text retrieval, it achieves recall increases of 4.7–6.8%. In reasoning tasks, HieVLP boosts accuracy by 2.96–5.8%. These results validate that explicit multi-level alignment enables contextually coherent caption generation and precise cross-modal reasoning.

Keywords: vision-and-language; transformer; multi-modal pre-training; multi-level multi-scale alignment

Share and Cite

MDPI and ACS Style

Xie, H.; Qin, Y.; Ding, S. Hierarchical Vision–Language Pre-Training with Freezing Strategy for Multi-Level Semantic Alignment. Electronics 2025, 14, 816. https://doi.org/10.3390/electronics14040816

AMA Style

Xie H, Qin Y, Ding S. Hierarchical Vision–Language Pre-Training with Freezing Strategy for Multi-Level Semantic Alignment. Electronics. 2025; 14(4):816. https://doi.org/10.3390/electronics14040816

Chicago/Turabian Style

Xie, Huiming, Yang Qin, and Shuxue Ding. 2025. "Hierarchical Vision–Language Pre-Training with Freezing Strategy for Multi-Level Semantic Alignment" Electronics 14, no. 4: 816. https://doi.org/10.3390/electronics14040816

APA Style

Xie, H., Qin, Y., & Ding, S. (2025). Hierarchical Vision–Language Pre-Training with Freezing Strategy for Multi-Level Semantic Alignment. Electronics, 14(4), 816. https://doi.org/10.3390/electronics14040816

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hierarchical Vision–Language Pre-Training with Freezing Strategy for Multi-Level Semantic Alignment

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI