- Article
Structured Scene Parsing with a Hierarchical CLIP Model for Images
- Yunhao Sun,
- Xiaoao Chen and
- Heng Chen
- + 2 authors
Visual Relationship Prediction (VRP) is crucial for advancing structured scene understanding, yet existing methods struggle with ineffective multimodal fusion, static relationship representations, and a lack of logical consistency. To address these limitations, this paper proposes a Hierarchical CLIP model (H-CLIP) for structured scene parsing. Our approach leverages a pre-trained CLIP backbone to extract aligned visual, textual, and spatial features for entities and their union regions. A multi-head self-attention mechanism then performs deep, dynamic multimodal fusion. The core innovation is a consistency and reversibility verification mechanism, which imposes algebraic constraints as a regularization loss to enforce logical coherence in the learned relation space. Extensive experiments on the Visual Genome dataset demonstrate the superiority of the proposed method. H-CLIP significantly outperforms state-of-the-art baselines on the predicate classification task, achieving a Recall@50 score of 64.31% and a Mean Recall@50 of 36.02%, thereby validating its effectiveness in generating accurate and logically consistent scene graphs even under long-tailed distributions.
Appl. Sci.,
12 January 2026



