Fine-Grained Local and Global Semantic Fusion for Multimodal Image–Text Retrieval

Peng, Shenao; Wang, Zhongmei; Liu, Jianhua; Zhang, Changfan; Jia, Lin

doi:10.3390/bdcc9030053

Open AccessArticle

Fine-Grained Local and Global Semantic Fusion for Multimodal Image–Text Retrieval

by

Shenao Peng

,

Zhongmei Wang

^*,

Jianhua Liu

,

Changfan Zhang

and

Lin Jia

College of Railway Transportation, Hunan University of Technology, Zhuzhou 412007, China

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(3), 53; https://doi.org/10.3390/bdcc9030053

Submission received: 30 December 2024 / Revised: 20 February 2025 / Accepted: 21 February 2025 / Published: 25 February 2025

Download

Browse Figures

Versions Notes

Abstract

:

An image–text retrieval method that integrates intramodal fine-grained local semantic information and intermodal global semantic information is proposed to address the weak fine-grained discrimination capabilities for the semantic features located between image and text modalities in cross-modal retrieval tasks. First, the original features of images and texts are extracted, and a graph attention network is employed for region relationship reasoning to obtain relation-enhanced local features. Then, an attention mechanism is used for different semantically interacting samples within the same modality, enabling comprehensive intramodal relationship learning and producing semantically enhanced image and text embeddings. Finally, a triplet loss function is used to train the entire model, and it is enhanced with an angular constraint. Through extensive comparative experiments conducted on the Flickr30K and MS-COCO benchmark datasets, the effectiveness and superiority of the proposed method were verified. It outperformed the current method by 6.4% relatively for image retrieval and 1.3% relatively for caption retrieval on MS-COCO (Recall@1 using the 1K test set).

Keywords:

cross-modal image retrieval; relational reasoning; global matching; triplet loss

1. Introduction

Image–text retrieval is a fundamental task at the intersection of computer vision and natural language processing (NLP). It aims to enable users to find corresponding images on the basis of textual queries or obtain appropriate textual descriptions from images through visual and linguistic understanding. The goal is to achieve cross-modal semantic alignment between images and texts [1], which is crucial for the deep integration of vision and language. In recent years, significant progress has been made in this field [2,3,4,5,6]. However, owing to the substantial semantic discrepancies between images and texts, the existing methods still fail to fully exploit the semantic information contained in both modalities; thus, further optimization is required.

Image–text retrieval methods typically consist of the following key modules [7]: modality embedding, modality interaction, and similarity computation modules. Modality embedding is used to extract visual and textual features. Modality interaction, which can be categorized into intramodal interaction [8,9] and cross-modal interaction methods [10,11,12], is responsible for learning similar features, establishing associations and alignments between images and texts, and uncovering semantic relationships across modalities. Similarity computation is used to quantify the semantic relevance between data derived from different modalities according to specific criteria, facilitating effective modality alignment. Generally, similarity computation relies on triplet loss [13] functions to differentiate between semantically similar and dissimilar samples. However, these methods struggle with fine-grained discrimination tasks and handling hard negatives. As shown in Figure 1, in the case of the “surfing” theme, subtle differences are present between actions such as “holding a surfboard” and “squatting on the surfboard” or “standing on the surfboard”, which are difficult to distinguish. The existing methods may fail to correctly identify such hard negatives, which impacts the alignment between different modalities. Moreover, the triplet loss suffers from the challenge of selecting appropriate margin parameters and its sensitivity to scale changes, with improper margin choices often leading to poor embedding learning results [14,15].

To address these issues and better capture the information contained in hard negatives, this paper proposes an exploring intramodal sample information network (EISIN) to further distinguish hard negatives and resolve the scale sensitivity issue in similarity measurement scenarios. Specifically, by exploring the relational and similarity relationships between different samples within the same modality, these relationships were incorporated into multihead attention mechanisms as masks and weights, respectively, to enhance the relationships by fully utilizing the semantics within the same modality. To further refine the process for handling hard negatives, we combined the angular loss (AL) with the triplet ranking loss, introducing angular constraints to guide negative samples away from the positive samples in the embedding space. This provided an additional optimization direction for the model, ultimately improving its multimodal alignment performance.

To sum up, our main contributions are the following: (a) We propose an image–text matching model that integrates fine-grained local and global semantic fusion. By learning the subtle differences between samples of the same modality, the model effectively distinguishes difficult samples with semantic ambiguity, resulting in better overall embeddings. (b) We constructed a new loss function by combining the triplet loss and triplet ranking loss, optimizing the gradient direction of negative samples to effectively prevent them from approaching positive samples, thereby achieving more precise alignment. (c) Extensive experiments on the Flickr30K and MS-COCO datasets demonstrated that the EISIN model outperforms existing methods.

2. Related Work

2.1. Global Alignment

The existing image–text retrieval methods typically learn a common embedding space in which features from different modalities can be compared for similarity calculation purposes, enabling global alignment by mapping images and texts to the same embedding space through two independent networks. Early works [16,17] often used convolutional neural networks (CNNs) to learn the global features of images, while recurrent neural networks (RNNs) or gated recurrent units (GRUs) were employed to extract textual features, and simple distance metrics were used to measure the similarity between the features of the two modalities. Subsequent research improved the feature learning process for both images and texts. For example, the VSRN [2] first uses graph convolutional networks (GCNs) to learn the fine-grained relationships between region-level features in images and then constructs global features via a GRU or employs pooling methods to aggregate the global features [12]. With the application of attention mechanisms in NLP, most of the current methods use bidirectional representations from transformers (BERT) [18] to extract text features and learn fine-grained relationships between image and text features to enhance their image–text retrieval performance.

In terms of similarity measurement, many works have adopted the triplet ranking loss and bidirectional triplet ranking loss to constrain the model learning process. In addition, VSE++ [13] employs batch hard negative mining to achieve improved computational efficiency and simplify the process. However, these methods require predefined margin information [14], and improper margin settings often result in suboptimal visual semantic embedding learning outcomes. To address this issue, adaptive-margin triplet loss functions [19] introduce angular constraints, avoiding reliance on margin information and further improving the robustness of the modality feature alignment procedure.

2.2. Local Alignment

Unlike global alignment methods, local alignment approaches focus on fine-grained matching between cross-modal entities, typically achieving this by aligning regional features within images with words or phrases in text. Such methods are capable of capturing more nuanced semantic relationships, thereby enhancing the accuracy and robustness of retrieval tasks.

Early local alignment techniques, such as SCAN [20], employed stacked cross-attention mechanisms to compute the similarity between individual segments and the entire segment, aligning local regions in images with corresponding words in text on a one-to-one basis to achieve fine-grained matching. Building upon this foundation, IMRAM [21] introduced memory distillation units and iteratively extracts cross-modal information, further improving the matching precision. Subsequently, numerous studies [22,23] integrated graph convolutional networks (GCNs) into fine-grained alignment processes, utilizing GCNs to infer feature representations of both images and text to optimize the model. Notably, NAAF [23] enhances the discriminative capability and robustness of negative samples by mining mismatched segments. With the advancement of attention mechanisms [24], local alignment methods have seen further enhancements. For instance, DAN [25] implements a bidirectional attention mechanism to achieve reciprocal alignment between image regions and text words, thereby strengthening the model’s capacity to capture fine-grained semantic relationships. Through detailed feature matching, local alignment methods address the limitations of global alignment approaches in capturing complex semantic relationships.

3. The Proposed Method

First, features are extracted from both images and texts; this is followed by region relationship reasoning, which is implemented using a graph attention network (GAT). Next, the relationships between different samples within the same modality are incorporated into a relational graph, and an attention mechanism is applied for semantic interaction purposes. This enables the information across different modality samples to be comprehensively learned, resulting in semantically enriched image and text embeddings. Finally, the triplet ranking loss is combined with the AL to optimize the gradient direction of the negative samples, achieving better alignment and enhancing the robustness of the model under significant data variations. Figure 2 shows the structure of proposed method.

3.1. Feature Extraction

Image region features are extracted via a faster region-based CNN (R-CNN) model [26] with ResNet-101 [27] as its backbone. Following the bottom-up approach [28], features are extracted from the m salient regions of an image. Each region is then mapped to a d-dimensional local feature via a fully connected (FC) layer. Specifically, given an image, its features can be represented as

\tilde{V} = {{\tilde{v}}_{1}, {\tilde{v}}_{2}, {\tilde{v}}_{3}, \dots, {\tilde{v}}_{m}}, {\tilde{v}}_{i} \in R^{d}

, where

{\tilde{v}}_{i}

encodes the feature of a salient region in the image.

Text features are primarily extracted via the BERT model [18] from the NLP domain, with an additional FC layer included to ensure that the text features have the same dimensionality as the image features. Specifically, for a text, T, consisting of n words, the associated word features can be represented as

\tilde{C} = {{\tilde{c}}_{1}, {\tilde{c}}_{2}, {\tilde{c}}_{3}, \dots, {\tilde{c}}_{n}}, {\tilde{c}}_{i} \in R^{d}

.

3.2. Region Relationship Reasoning

To facilitate the acquisition of contextual information between image and text features, distinct visual semantic reasoning and textual semantic reasoning models were formulated. These models were designed to effectively capture the intricate semantic relationships between images and texts, thereby enhancing the accuracy and robustness of cross-modal matching.

Considering the semantic correlations between image regions, a region relationship graph is first constructed. Then, a semantic interaction model is used to learn contextual information. Specifically, the pairwise similarity between different image regions is measured via Equation (1) to construct a relationship matrix:

R ({\tilde{v}}_{i}, {\tilde{v}}_{j}) = ω {({\tilde{v}}_{i})}^{T} μ ({\tilde{v}}_{j})

(1)

where

ω ({\tilde{v}}_{i}) = W_{ω} {\tilde{v}}_{i}

and

μ ({\tilde{v}}_{j}) = W_{μ} {\tilde{v}}_{j}

are two embedded features. The weight parameters

W_{ω}

and

W_{μ}

are learned through backpropagation. The region features

\tilde{V} = {{\tilde{v}}_{1}, {\tilde{v}}_{2}, {\tilde{v}}_{3}, \dots, {\tilde{v}}_{m}}

are subsequently used as the nodes, V, of the graph, whereas the relationship matrix R serves as the edges, E, for constructing the relationship graph

G_{I} = (V, E)

. A GAT [10], specifically a self-attention layer [13], is then applied to this FC graph to perform reasoning, capture semantic relationships, and learn the relation-enhanced local features,

V^{*} = {v_{1}^{*}, v_{2}^{*}, v_{3}^{*}, . . ., v_{m}^{*}}, v_{i}^{*} \in R^{d}

. Finally, the original region features and the enhanced features are aggregated via a combination of maximum pooling and average pooling, resulting in a global visual embedding,

v \in R^{d}

, as shown in Equation (2):

v = η \cdot M a x P o o l (V) + (1 - η) \cdot A v g P o o l (V^{*})

(2)

where

η

is used to control the ratio between the two types of representations.

Akin to visual semantic reasoning, a semantic relation graph is constructed between words in the given text. In the FC graph

G_{T} = (V, E)

, the nodes, V, represent word features,

\tilde{C} = {{\tilde{c}}_{1}, {\tilde{c}}_{2}, {\tilde{c}}_{3}, \dots, {\tilde{c}}_{n}}

, and the edges denote the semantic relationships between words. A self-attention layer is also applied to obtain the relation-enhanced word features,

C^{*} = {c_{1}^{*}, c_{2}^{*}, c_{3}^{*}, \dots, c_{n}^{*}}

. Finally, maximum pooling and average pooling are used to aggregate these word features, resulting in a global textual embedding,

u \in R^{d}

, as shown in Equation (3):

u = η \cdot M a x P o o l (C) + (1 - η) \cdot A v g P o o l (C^{*})

(3)

3.3. Semantic Relationship Enhancement

To incorporate the semantic information derived from samples within the same modality, a relationship matrix and a weight matrix are first constructed. These two matrices are then combined with a relation interaction mechanism to capture subtle semantic information, ultimately producing semantically enhanced image–text embeddings. During the process of learning semantic information within the same modality, the FC graph causes the embeddings derived from similar images to become increasingly similar, whereas the embeddings acquired from dissimilar images become progressively more distinct as they propagate through multiple layers of the neural network. This can lead to oversmoothing and noise association [29]. Therefore, it is necessary to reconstruct a new relationship matrix to facilitate the learning of semantic information within each modality. When two embedding nodes are close to each other, their semantic information typically overlaps, indicating a possible semantic connection [30]. Thus, given N image–text pairs,

{v_{i}, c_{i}}_{i = 1}^{N}

, the similarity between all matching samples is computed. The top

θ %

of the feature vectors are then selected, assuming that they share a semantic connection:

{(A_{I \to I})}^{i j} = \{\begin{matrix} 1, & S (v_{i} \cdot v_{j}) \geq q u a n t i l e (S, θ) \\ 0, & S (v_{i} \cdot v_{j}) < q u a n t i l e (S, θ) \end{matrix}

(4)

{(A_{T \to T})}^{i j} = \{\begin{matrix} 1, & S (c_{i} \cdot c_{j}) \geq q u a n t i l e (S, θ) \\ 0, & S (c_{i} \cdot c_{j}) < q u a n t i l e (S, θ) \end{matrix}

(5)

where

S (\cdot)

represents the similarity function in the joint embedding space, and in the experiments, the conventional inner product is used.

q u a n t i l e (S, θ)

denotes the top

θ %

of the similarity values.

Given N image–text pairs,

{u_{i}, v_{i}}_{i = 1}^{N}

, for samples within the same modality, the relevance weight matrix is computed via global embeddings:

{(X_{I \to I})}_{i j} = e^{- \frac{∥ u_{i} - u_{j} ∥_{2}^{2}}{σ}}, {(X_{T \to T})}_{i j} = e^{- \frac{∥ v_{i} - v_{j} ∥_{2}^{2}}{σ}}

(6)

where

σ

is a positive scalar that controls the relevance value (for simplicity,

σ = 1

).

After the embedding relationship matrix and weight matrix are obtained, a relation interaction mechanism is employed to capture the observed semantic relationships. The visual and textual embeddings are separately input into the multihead attention module to achieve cross-modal relation interaction, where the queries (Q) and key–value pairs (

K V

) come from two different modalities. The connection matrix A is used as the attention mask matrix for the attention module, whereas the association matrix X serves as an additional attention weight matrix for explicit relation modelling [11], where

λ

is used to balance X with the original attention weight matrix. Therefore, the basic attention formula is modified as follows:

A t t (Q K V; A, X) = \underset{m a s k (A)}{s o f t max} (\frac{Q K^{T}}{\sqrt{d_{k}}} + λ X) V

(7)

After the relation interaction phase, the relation-enhanced features for the two modalities,

{\bar{v_{1}}, \dots, \bar{v_{N}}}

and

{\bar{c_{1}}, \dots, \bar{c_{N}}}

, are obtained.

3.4. Loss Function

The AL and triplet ranking loss are combined to address the issue of large distributional differences between features belonging to different classes within the same modality in the embedding space, as well as their sensitivity to scale variations.

For the relationship-enhanced image and text features

{\bar{v_{1}}, . . ., \bar{v_{N}}}

and

{\bar{c_{1}}, . . ., \bar{c_{N}}},

the triplet ranking loss based on hard negative mining [13] is used for matching. The loss is defined as shown below:

L_{t r i p l e t} (v, c) = {[α - S (v, c) + S (v, \hat{c})]}_{+} + {[α - S (v, c) + S (\hat{v}, c)]}_{+}

(8)

where

α

is the margin parameter and

{[x]}_{+} = max (x, 0)

.

S (\cdot)

is the similarity function in the joint embedding space.

(v, c)

represents a pair of positive samples, whereas

(v, \hat{c})

and

(\hat{v}, c)

represent a set of negative sample pairs. The hard negative samples are given by

\hat{v} = arg {max}_{j \neq v} S (j, c)

and

\hat{c} = arg {max}_{i \neq c} S (v, i)

.

As shown in Figure 3a, the gradient directions in the triplet ranking loss may cause the distance between the negative and positive samples to decrease. To address this issue, the AL [8,15] is introduced, as shown in Figure 3b, where angular constraints are constructed on the basis of the distances between the anchor point and the positive/negative samples, forming a triangle. This provides an additional constraint source while ensuring that the gradient direction of the negative samples moves away from the positive samples and the anchor point. This prevents the issue encountered when using the triplet ranking loss where pushing negative samples away from the anchor point inadvertently brings them closer to the positive samples, thus achieving better alignment effects:

L_{a n g u l a r} (v, c) = log [1 + exp (f (e^{v}, e^{c}, e^{\hat{c}}))] + log [1 + exp (f (e^{c}, e^{v}, e^{\hat{v}}))]

(9)

where

α

is the parameter that constrains the angle of the AL.

f (a, p, n) = 4 {tan}^{2} α (a + p) n^{T} - 2 (1 + {tan}^{2} α) a p^{T}

;

a

,

p

, and

n

represent the image and text embeddings, respectively.

\hat{v} = arg {max}_{τ \neq v} f (e^{c}, e^{v}, e^{τ})

and

\hat{c} = arg {max}_{ψ \neq c} f (e^{v}, e^{c}, e^{ψ})

are the hard negatives in the AL.

Finally, the triplet ranking loss and AL are combined to obtain the final loss function:

L (v, c) = L_{t r i p l e t} (v, c) + ω L_{a n g u l a r} (v, c)

(10)

where

ω

is used to control the importance of the AL.

4. Experimental Results and Analysis

4.1. Datasets and Evaluation Metrics

To evaluate the effectiveness of the proposed method, two public datasets, MS-COCO [31] and Flickr30K [32], were used. MS-COCO contains 123,287 images, each with five corresponding textual descriptions. Based on [13,33], 113,287 images were used for training, 5000 images were used for validation, and the remaining 5000 images were used for testing. The MS-COCO test results are presented as the averages of fivefold cross-validation results produced using 1000 test samples (COCO 5-fold 1K test) and all 5000 test samples (COCO 5K test). Flickr30K contains 31,783 images, each with five textual descriptions. On the basis of the split described in [13], 1014 images were used for validation, 1000 images were used for testing, and 29,000 images were used for training. Text–image retrieval methods are typically evaluated via the Recall@K (K = 1, 5, 10) metrics, which are denoted as R@1, R@5, and R@10. The Recall@K represents the percentage of relevant items found within the top K retrieved items, with higher values indicating better retrieval accuracy. The total scores were calculated for three image retrieval metrics and three text retrieval metrics and combined into an overall evaluation metric for text–image retrieval, which is referred to as the rSum:

r S u m = {(R @ 1 + R @ 5 + R @ 10)}_{i m a g e} + {(R @ 1 + R @ 5 + R @ 10)}_{t e x t}

(11)

In terms of the experimental setup, all the experiments were implemented via the PyTorch (version 1.13.1) framework on an NVIDIA Tesla T4 GPU. During training, the adaptive moment estimation (Adam) optimizer was used with an initial learning rate of

1 \times 10^{- 5}

, which decayed by a factor of 0.1 every 10 iterations. The batch sizes for Flickr30K and MS-COCO were set to 128 and 256, respectively. Pre-extracted image region features [26] were used, and both the image and text feature dimensions were transformed into d-dimensional vectors, with

d = 1024

. The hyperparameter

η = 0.8

controlled the pooling ratio between the two modalities. During semantic relationship enhancement learning (SEL), the quantile

θ %

was set to 50%, and

λ = 1.5

was set accordingly. The hyperparameters in the loss function were

α = 0.2

,

β = 0.5

, and

ω = 0.65

.

4.2. Comparative Experiments

To validate the superiority of the proposed method, the model was compared with recently published advanced models. Owing to experimental environment limitations, large pretrained models were not included in the comparison. The experimental results were obtained either by running the source code provided in the original papers or by citing the experimental results from the original works for comparison purposes. The best results obtained in terms of the R@K and rSum are highlighted in bold.

Table 1 presents a comparison between the proposed method and the recently published advanced models on the MS-COCO dataset. Table 2 shows a comparison between the proposed method and the recently published advanced models on the Flickr30K dataset. On the COCO 5K test set, the EISIN model achieved the highest R@K performance across the three metrics. It outperformed HREM [3] and VSRN++ [34] by 1% and 1.5% in terms of the rSum, respectively. Although the image retrieval performance of the EISIN model was comparable to that of the previously developed methods, it exhibited a significant improvement in the text retrieval task, outperforming HREM [3] and VSRN++ [34] by approximately 2.3% and 2.7%, respectively. The EISIN model also demonstrated a substantial performance gain on the COCO 5-fold 1K test set.

On the Flickr30K test set, the EISIN model achieved the highest R@K performance across the four metrics, with an rSum improvement of approximately 2.3% over that of VSRN++ [34].

In addition to the accuracy of caption or image retrieval, we also argue that efficiency during the inference stage is crucial when evaluating a model’s performance. This is especially important when the model significantly boosts the retrieval performance but its inference stage is time-consuming. Therefore, we present a comparison of the proposed method with recent works in terms of the inference time and model parameters in Table 3 and Table 4, respectively. Table 3 presents the time consumption for calculating embeddings (“Encoding”) and obtaining similarity scores for all image–text pairs (“Matching”). A comparison with recent methods was conducted on the Flickr30K and MS-COCO test sets. The data loading time and feature extraction time were excluded, as these were consistent across the methods. All methods were tested on the same machine, and the proposed approach demonstrated a significant advantage in the matching efficiency, which is crucial for retrieval tasks. Table 4 provides detailed information on image–text retrieval, including the number of epochs trained, batch size, learning rate, and number of parameters. The data in the “LR” column represent the initial learning rate, the epoch when the learning rate changed, and the change rate.

Additionally, to visually demonstrate the superiority of the EISIN framework, Figure 4 compares two image–text retrieval methods, the EISIN and VSRN [2], with green markers indicating correct retrievals and red markers indicating incorrect retrievals. As shown, the EISIN outperforms the VSRN [2] in terms of its image–text retrieval ability.

4.3. Ablation Studies

Ablation studies were conducted on the main components of the proposed method using the MS-COCO 5K dataset to investigate the impacts of these components on the resulting model’s performance. The experimental results obtained with various configurations are shown in Table 5.

(1) The effectiveness of the AL: The performance of the model was evaluated after the AL was removed. As shown in Table 5, removing the AL resulted in an overall decline of approximately 2.2% in the evaluation metrics in the text retrieval task and a 1.0% decrease in the rSum. This demonstrates that the AL effectively optimizes the gradient direction, causing the gradient of the negative samples to move away from both the positive samples and the anchor point, thus leading to better retrieval performance. Compared with that on text retrieval, the impact of the AL on image retrieval is less significant, which may be due to the larger distributional differences between the features of different classes in the image embedding space, making the loss more impactful for retrieval.

(2) The effectiveness of SEL: The performance of the model was also evaluated after removing the SEL module, which prevented the model from learning semantic relationships within the same modality. As shown in Table 5, all six evaluation metrics and the rSum significantly decreased, demonstrating that the EISIN effectively enabled the constructed model to learn semantic information between images and texts. By utilizing more comprehensive semantic information, the model achieved better retrieval performance, with improvements in all evaluation metrics, thus confirming the effectiveness of the proposed method.

5. Conclusions

An image–text retrieval method is proposed to optimize the global matching process. This approach not only integrates the advantages of global matching but also effectively utilizes the subtle semantic relationships between samples within the same modality to distinguish hard negative samples with semantic ambiguities. By establishing a weighted relationship between different samples within the same modality and incorporating it into an attention mechanism, the developed method captures the semantic information between the samples. Additionally, the AL is introduced to provide an extra constraint source, leading to a better alignment effect. Extensive experiments conducted on two benchmark datasets, Flickr30K and MS-COCO, demonstrated that the proposed method effectively improves the accuracy of image–text retrieval. However, the model relies on region features extracted by object detectors, which introduces the issue of error propagation during training. The accumulation of errors can lead to a decline in the alignment accuracy and even affect the overall performance of the model. Future research will focus on further refining the feature extraction strategies, particularly by leveraging the Vision Transformer (ViT) to directly learn feature representations from raw image data and incorporating the Visual Language–Context Patch to mitigate error propagation and further optimize the performance of the model.

Author Contributions

Z.W.: conceptualization, methodology; S.P.: methodology, experimental studies, writing—original draft; C.Z.: writing—review; J.L.: writing—review and editing; L.J.: investigation. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R and D Program of China (2021YFF0501101), the National Natural Science Foundation of China Youth Fund Project (Grant 62106074), the National Natural Science Foundation of China (52272347), and the National Science Fund of Hunan (2024JJ7132).

Data Availability Statement

The datasets used in this study are available at https://www.kaggle.com/datasets/kuanghueilee/scan-features (accessed on 3 July 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pan, Z.; Wu, F.; Zhang, B. Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 19275–19284. [Google Scholar]
Li, K.; Zhang, Y.; Li, K.; Li, Y.; Fu, Y. Visual Semantic Reasoning for Image-Text Matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Fu, Z.; Mao, Z.; Song, Y.; Zhang, Y. Learning Semantic Relationship Among Instances for Image-Text Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 15159–15168. [Google Scholar]
Wei, X.; Zhang, T.; Li, Y.; Zhang, Y.; Wu, F. Multi-Modality Cross Attention Network for Image and Sentence Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Albalawi, B.M.; Jamal, A.T.; Al Khuzayem, L.A.; Alsaedi, O.A. An End-to-End Scene Text Recognition for Bilingual Text. Big Data Cogn. Comput. 2024, 8, 117. [Google Scholar] [CrossRef]
Zihao Ni, Z.Z.; Ren, P. Incorporating object counts into remote sensing image captioning. Int. J. Digit. Earth 2024, 17, 2392847. [Google Scholar] [CrossRef]
Rao, J.; Wang, F.; Ding, L.; Qi, S.; Zhan, Y.; Liu, W.; Tao, D. Where Does the Performance Improvement Come From?—A Reproducibility Concern about Image-Text Retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval—SIGIR’22, New York, NY, USA, 11–15 July 2022; pp. 2727–2737. [Google Scholar] [CrossRef]
Wu, Y.; Wang, S.; Song, G.; Huang, Q. Learning Fragment Self-Attention Embeddings for Image-Text Matching. In Proceedings of the 27th ACM International Conference on Multimedia—MM’19, New York, NY, USA, 21–25 October 2019; pp. 2088–2096. [Google Scholar] [CrossRef]
Qu, L.; Liu, M.; Cao, D.; Nie, L.; Tian, Q. Context-Aware Multi-View Summarization Network for Image-Text Matching. In Proceedings of the 28th ACM International Conference on Multimedia—MM’20, Seattle, WA, USA, 12–16 October 2020; pp. 1047–1055. [Google Scholar] [CrossRef]
Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. Stat 2017, 1050, 10–48550. [Google Scholar]
Ashish, V. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, I. [Google Scholar]
Chen, J.; Hu, H.; Wu, H.; Jiang, Y.; Wang, C. Learning the Best Pooling Strategy for Visual Semantic Embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 15789–15798. [Google Scholar]
Faghri, F.; Fleet, D.J.; Kiros, J.R.; Fidler, S. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. arXiv 2018, arXiv:1707.05612. [Google Scholar] [CrossRef]
Liu, M.; Qi, M.; Zhan, Z.; Qu, L.; Nie, X.; Nie, L. A Survey on Deep Learning Based Image-Text Matching. Chin. J. Comput. 2023, 46, 2370–2399. [Google Scholar]
Wang, J.; Zhou, F.; Wen, S.; Liu, X.; Lin, Y. Deep Metric Learning with Angular Loss. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Wang, L.; Li, Y.; Lazebnik, S. Learning Deep Structure-Preserving Image-Text Embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Wang, L.; Li, Y.; Huang, J.; Lazebnik, S. Learning Two-Branch Neural Networks for Image-Text Matching Tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 394–407. [Google Scholar] [CrossRef] [PubMed]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar] [CrossRef]
Biten, A.F.; Mafla, A.; Gómez, L.; Karatzas, D. Is an Image Worth Five Sentences? A New Look Into Semantics for Image-Text Matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 1391–1400. [Google Scholar]
Lee, K.H.; Chen, X.; Hua, G.; Hu, H.; He, X. Stacked Cross Attention for Image-Text Matching. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Chen, H.; Ding, G.; Liu, X.; Lin, Z.; Liu, J.; Han, J. IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Cheng, Y.; Zhu, X.; Qian, J.; Wen, F.; Liu, P. Cross-modal Graph Matching Network for Image-text Retrieval. ACM Trans. Multimed. Comput. Commun. Appl. 2022, 18, 1–23. [Google Scholar] [CrossRef]
Zhang, K.; Mao, Z.; Wang, Q.; Zhang, Y. Negative-Aware Attention Framework for Image-Text Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 15661–15670. [Google Scholar]
Wang, J.H.; Norouzi, M.; Tsai, S.M. Augmenting Multimodal Content Representation with Transformers for Misinformation Detection. Big Data Cogn. Comput. 2024, 8, 134. [Google Scholar] [CrossRef]
Nam, H.; Ha, J.W.; Kim, J. Dual Attention Networks for Multimodal Reasoning and Matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Seidenschwarz, J.D.; Elezi, I.; Leal-Taixé, L. Learning Intra-Batch Connections for Deep Metric Learning. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021; Proceedings of Machine Learning Research. Meila, M., Zhang, T., Eds.; Volume 139, pp. 9410–9421. [Google Scholar]
KAYA, M.; BİLGE, H.Ş. Deep Metric Learning: A Survey. Symmetry 2019, 11, 1066. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision–ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer Nature: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Plummer, B.A.; Wang, L.; Cervantes, C.M.; Caicedo, J.C.; Hockenmaier, J.; Lazebnik, S. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Messina, N.; Amato, G.; Esuli, A.; Falchi, F.; Gennaro, C.; Marchand-Maillet, S. Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders. ACM Trans. Multimed. Comput. Commun. Appl. 2021, 17, 1–23. [Google Scholar] [CrossRef]
Li, K.; Zhang, Y.; Li, K.; Li, Y.; Fu, Y. Image-Text Embedding Learning via Visual and Textual Semantic Reasoning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 641–656. [Google Scholar] [CrossRef] [PubMed]
Zhang, Q.; Lei, Z.; Zhang, Z.; Li, S.Z. Context-Aware Attention Network for Image-Text Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Wei, J.; Xu, X.; Yang, Y.; Ji, Y.; Wang, Z.; Shen, H.T. Universal Weighting Metric Learning for Cross-Modal Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Pei, J.; Zhong, K.; Yu, Z.; Wang, L.; Lakshmanna, K. Scene Graph Semantic Inference for Image and Text Matching. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2023, 22, 1–23. [Google Scholar] [CrossRef]
Zhou, H.; Geng, Y.; Zhao, J.; Ma, X. Semantic-Enhanced Attention Network for Image-Text Matching. In Proceedings of the 2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Tianjin, China, 8–10 May 2024; pp. 1256–1261. [Google Scholar] [CrossRef]
Liu, C.; Mao, Z.; Liu, A.-A.; Zhang, T.; Wang, B.; Zhang, Y. Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching. In Proceedings of the 27th ACM International Conference on Multimedia (MM ’19), Nice, France, 21–25 October 2019; pp. 3–11. [Google Scholar] [CrossRef]
Wang, Z.; Liu, X.; Li, H.; Sheng, L.; Yan, J.; Wang, X.; Shao, J. CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5764–5773. Available online: https://openaccess.thecvf.com/content_ICCV_2019/html/Wang_CAMP_Cross-Modal_Adaptive_Message_Passing_for_Text-Image_Retrieval_ICCV_2019_paper.html (accessed on 3 July 2024).

Figure 1. Similar behaviours under the same theme.

Figure 2. Structure of the EISIN. First, features are extracted from both images and texts (Section 3.1). Then, region relationship reasoning is performed using a graph attention network (GAT) to capture spatial and semantic relationships (Section 3.2). Next, a relational graph is created to incorporate the relationships within each modality, and an attention mechanism is applied for cross-modal semantic interaction (Section 3.3). Finally, the triplet ranking loss, combined with the angular loss (AL), optimizes negative sample gradients, improving the alignment and robustness under data variations (Section 3.4).

Figure 3. Losses and gradient visualization. (a) Triplet loss and gradient illustration; (b) angular loss and gradient illustration.

Figure 4. Examples of the proposed method and the VSRN retrieval method.

Table 1. Comparison of the experimental results on the MS-COCO dataset.

Method	Text Retrieval			Image Retrieval			rSum
Method	R@1	R@5	R@10	R@1	R@5	R@10	rSum
COCO 5K Test
VSE++ [13]	41.3	71.1	81.0	30.3	59.4	72.4	355.7
SCAN [20]	50.4	82.4	82.9	38.6	69.3	80.4	410.9
VSRN [2]	53.0	81.1	81.1	40.5	70.6	81.1	415.7
IMRAM [21]	53.7	83.7	83.7	39.7	69.1	79.8	416.5
CAAN [35]	52.5	83.0	83.9	41.2	70.3	82.9	421.1
CGMN [22]	53.4	84.1	85.8	41.2	71.9	82.4	419.8
NAAF [23]	58.9	84.1	85.8	42.5	70.9	81.4	430.9
VSRN++ [34]	54.7	87.2	87.2	42.0	72.2	82.7	425.4
HREM [3]	57.7	82.5	89.9	42.6	72.3	82.7	427.7
EISIN	59.6	84.4	91.2	42.0	72.3	82.5	432.1
COCO 5-Fold 1K Test
VSE++ [13]	64.6	90.0	95.7	52.0	84.3	92.0	478.6
SCAN [20]	72.7	94.8	98.4	58.8	88.4	94.8	507.9
VSRN [2]	76.2	94.8	98.2	62.8	89.7	95.1	516.8
IMRAM [21]	76.7	95.6	98.5	61.7	89.1	95.0	516.6
CAMERA [9]	77.5	96.3	98.8	63.4	90.9	95.8	522.7
MPL [36]	71.1	93.7	98.2	56.8	86.7	93.0	499.5
NAAF [23]	78.1	96.1	98.6	63.5	89.6	95.3	521.2
SGSIN [37]	76.7	96.5	99.1	61.7	89.6	95.3	523.6
SEAM [38]	77.9	95.6	98.3	64.2	91.2	96.4	523.6
VSRN++ [34]	77.9	96.0	98.5	64.1	91.0	96.1	523.6
HREM [3]	78.2	95.3	98.2	64.4	90.9	95.9	522.9
EISIN	79.3	95.9	98.5	63.9	90.8	95.8	524.1

Bold indicates the best results in each column.

Table 2. Comparison of the experimental results obtained on the Flickr30K dataset.

Method	Text Retrieval			Image Retrieval			rSum
Method	R@1	R@5	R@10	R@1	R@5	R@10	rSum
VSE++ [13]	52.9	80.5	87.2	39.6	70.1	79.5	409.8
SCAN [20]	67.4	90.3	95.8	48.6	71.7	85.2	465.0
VSRN [2]	71.3	90.6	96.0	54.7	81.8	88.2	482.6
IMRAM [21]	74.1	93.0	96.6	53.9	79.4	87.2	484.2
CAMERA [9]	78.0	95.1	97.9	60.3	85.9	91.7	508.2
MPL [36]	69.4	89.9	95.4	47.5	75.5	83.1	460.8
NAAF [23]	79.6	96.3	98.3	59.3	83.9	90.2	507.6
SGSIN [37]	73.1	93.6	96.8	53.9	80.1	87.2	484.7
SEAM [38]	79.1	94.2	98.7	61.8	86.5	90.6	510.9
VSRN++ [34]	79.2	94.6	97.5	60.6	85.6	91.4	508.9
HREM [3]	83.3	96.1	98.4	62.2	86.4	91.8	518.2
EISIN	83.3	96.2	98.3	63.6	87.3	92.0	520.7

Bold indicates the best results in each column.

Table 3. Comparisons of the inference time of recent methods whose code is publicly available.

Method	Flickr30K		MS-COCO
Method	Encoding	Matching	Encoding	Matching
SCAN [20]	9.7 s	599.0 s	44.6 s	2746.4 s
BFAN [39]	12.9 s	1158.4 s	58.7 s	5744.2 s
CAMP [40]	4.3 s	1291.5 s	19.9 s	6523.9 s
MPL [36]	10.2 s	648.7 s	46.3 s	3021.0 s
IMRAM [21]	9.8 s	680.5 s	47.7 s	3417.4 s
VSRN [2]	16.7 s	4.7 s	74.3 s	21.6 s
EISIN	20.1 s	4.9 s	94.3 s	20.9 s

Table 4. Comparisons of details of recent methods.

Method	Flickr30K			MS-COCO			Params
Method	Epoch	Batch Size	LR	Epoch	Batch Size	LR	Params
VSE++ [13]	30	128	0.0002/15/ $\times 0.1$	30	128	0.0002/15/ $\times 0.1$	67 M
SCAN [20]	30	128	0.0002/15/ $\times 0.1$	20	128	0.0005/10/ $\times 0.1$	9 M
VSRN [2]	30	128	0.0002/15/ $\times 0.1$	30	128	0.0002/15/ $\times 0.1$	140 M
CAMERA [9]	30	128	0.0001/10/ $\times 0.1$	40	128	0.0001/20/ $\times 0.1$	156 M
SEAM [38]	30	64	0.0001/10/ $\times 0.1$	30	128	0.0002/10/ $\times 0.1$	114 M
HREM [3]	30	128	0.0002/15/ $\times 0.1$	30	128	0.0002/15/ $\times 0.1$	131 M
EISIN	30	128	0.0002/15/ $\times 0.1$	30	128	0.0002/15/ $\times 0.1$	126 M

Table 5. Results of ablation studies conducted on the MS-COCO 5K test set.

Method	Text Retrieval			Image Retrieval			rSum
Method	R@1	R@5	R@10	R@1	R@5	R@10	rSum
w/o AL	57.5	82.4	89.2	42.6	72.2	82.7	427.1
w/o SEL	58.7	83.9	90.7	41.7	72.0	82.3	429.4
w/o SEL and AL	56.8	82.3	89.6	42.5	72.9	82.5	426.7
EISIN	59.6	84.4	91.2	42.0	72.3	82.5	432.1

Bold indicates the best results in each column.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Peng, S.; Wang, Z.; Liu, J.; Zhang, C.; Jia, L. Fine-Grained Local and Global Semantic Fusion for Multimodal Image–Text Retrieval. Big Data Cogn. Comput. 2025, 9, 53. https://doi.org/10.3390/bdcc9030053

AMA Style

Peng S, Wang Z, Liu J, Zhang C, Jia L. Fine-Grained Local and Global Semantic Fusion for Multimodal Image–Text Retrieval. Big Data and Cognitive Computing. 2025; 9(3):53. https://doi.org/10.3390/bdcc9030053

Chicago/Turabian Style

Peng, Shenao, Zhongmei Wang, Jianhua Liu, Changfan Zhang, and Lin Jia. 2025. "Fine-Grained Local and Global Semantic Fusion for Multimodal Image–Text Retrieval" Big Data and Cognitive Computing 9, no. 3: 53. https://doi.org/10.3390/bdcc9030053

APA Style

Peng, S., Wang, Z., Liu, J., Zhang, C., & Jia, L. (2025). Fine-Grained Local and Global Semantic Fusion for Multimodal Image–Text Retrieval. Big Data and Cognitive Computing, 9(3), 53. https://doi.org/10.3390/bdcc9030053

Article Menu

Fine-Grained Local and Global Semantic Fusion for Multimodal Image–Text Retrieval

Abstract

1. Introduction

2. Related Work

2.1. Global Alignment

2.2. Local Alignment

3. The Proposed Method

3.1. Feature Extraction

3.2. Region Relationship Reasoning

3.3. Semantic Relationship Enhancement

3.4. Loss Function

4. Experimental Results and Analysis

4.1. Datasets and Evaluation Metrics

4.2. Comparative Experiments

4.3. Ablation Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI