1. Introduction
Cross-modal retrieval has shown great potential in handling large-scale multimedia retrieval tasks, primarily due to the use of Approximate Nearest Neighbor (ANN) search [
1,
2,
3]. The primary challenge, however, lies in effectively bridging the semantic gap between modalities to accurately capture their semantic relationships. As the scale of the data increases, these methods often experience a serious decline in efficiency, primarily due to the growing computational complexity. In this context, hashing techniques offer a promising and efficient solution for cross-modal ANN search, as they embed high-dimensional data points into a compact binary subspace, with similarity across modalities being measured via fast XOR operations. Recently, rapid advancements in deep learning have sparked increased interest in deep hashing approaches [
4,
5,
6,
7,
8,
9,
10,
11,
12], which have been shown to outperform traditional shallow models in terms of semantic representation capabilities [
13,
14,
15,
16,
17,
18].
As prominent learning strategies for deep hashing, pairwise loss
and triplet loss
based on Data-to-Data Metric Learning effectively preserve inter-class structural relationships by pulling similar instances closer together while pushing dissimilar instances further apart. However, these approaches predominantly focus on the relative distances between samples without imposing global constraints on the intra-class distribution, which leads to intra-class dispersion. This limitation is further exacerbated by the hashing quantization process, which significantly undermines the model’s ability to capture fine-grained semantic representations. To address this issue, some recent strategies have emerged that are based on Data-to-Class Metric Learning, such as center loss
[
12,
19] and proxy loss
[
4,
20,
21,
22]. Center loss penalizes the distance between each sample and its corresponding class center, effectively clustering intra-class samples and promoting a more compact intra-class distribution. However, in multi-label scenarios, learning a stable center vector for each class remains challenging. Proxy loss-based methods mitigate this issue by introducing a proxy vector for each single-label category and learning the hash representation by minimizing the distance between samples and their corresponding proxies while maximizing the distance between samples and proxies from other categories. Notably, proxy-based hashing methods demonstrate remarkable generalization capabilities, particularly in handling unseen categories, as they shift the learning focus from modeling the relationships between samples to learning the relationships between samples and proxy vectors. Despite their advantages, however, as illustrated in
Figure 1, these approaches also suffer from a significant drawback: they struggle to preserve the inter-class similarity structure.
To simultaneously preserve intra-class and inter-class similarity structures, an intuitive approach is to combine
with
. However, constructing an optimal solution that leverages the advantages of both objectives presents significant challenges. Firstly, the semantic bias issue inevitably arises when considering the similarity relationships not only among data points themselves but also between data points and class proxies. This issue becomes particularly pronounced in scenarios where there are imbalances in the number of data points across different classes. Taking
Figure 2 as an example, class
is related to proxy points
,
, and
, and class
is related to proxy points
and
. The pentagrams (
★★★★) represent proxies for each category. When considering both types of relationships, all the data points are prone to semantic bias—they will be predominantly attracted to proxy
if a significantly larger proportion of them are related to proxy
compared to other proxies. Unfortunately, ensuring class balance in real-world applications is challenging, which makes semantic bias a significant obstacle to the practical deployment of this approach.
Worse still, despite preserving inter-class similarity relationships effectively, directly combining and will cause an intra-class dispersion issue again. Considering the relationships between data point A and other data points, as well as the relationships between data point A and relevant category proxies, the connections of data point A with other data points are far more numerous than its connections with relevant category proxies. Data-to-Data Metric Learning still plays a dominant role, inevitably leading to intra-class dispersion. This limitation, in turn, undermines the preservation of the intra-class similarity structure.
To answer the above questions, we proposed a novel Deep Class-Guided Hashing (DCGH) method for multi-label cross-modal retrieval. We have introduced a variance constraint to ensure that the distances between each data point and its corresponding proxy are as consistent as possible, preventing semantic bias. Regarding the issue of intra-class dispersion, considering that in multi-label datasets, the number of irrelevant pairs in pairwise loss is much smaller than the number of relevant pairs and that positive sample pairs have a stronger constraint and are more likely to cause intra-class dispersion compared to negative sample pairs, we assign different small weights to the positive and negative sample constraints of pairwise loss. This allows for the aggregation of similar data points to be primarily guided by proxy loss, thereby addressing the issue of intra-class dispersion.
To sum up, the main contributions of this article are threefold:
General advancements: Starting from the perspective of intra-class aggregation and inter-class structural relationship maintenance, this paper proposes a combination of proxy loss and pairwise loss;
Innovative methodologies: In further considering the issues arising from the combination of proxy loss and pairwise loss, the DCGH method is proposed;
Comprehensive experiments: Extensive experiments on three benchmark datasets demonstrate that the proposed DCGH algorithm outperforms other baseline methods in cross-modal hashing retrieval.
The remainder of this paper is organized as follows.
Section 2 introduces representative supervised cross-modal hashing methods and unsupervised cross-modal hashing methods.
Section 3 introduces the proposed method in detail. The experimental results and analysis are provided in
Section 4, and
Section 5 summarizes the conclusions and future work.
3. Methodology
This section introduces the proposed method. We begin with a formal problem formulation. Subsequently, an overview of the DCGH framework is provided, followed by a detailed discussion of its individual components.
3.1. Problem Formulation
Notations. Without loss of generality, sets are denoted by math script uppercase letters (e.g.,
). Scalers or constants are signed by uppercase letters (e.g.,
C). Matrices are represented in uppercase bold letters (e.g.,
), and the
-th element of
is denoted by
. Vectors are denoted by lowercase bold letters (e.g.,
), and the
i-th element of
is represented by
. A transpose of a matrix or a vector is denoted by a superscript, T (e.g.,
). Functions are denoted by calligraphy uppercase letters (e.g.,
). The frequently used mathematical notations are summarized in
Table 1 for readability.
Problem Definition. This work focuses on two common modalities: image (denoted by x) and text (denoted by y). Given a dataset, , where represents the i-th sample, and represent the i-th image and text, respectively. is the multi-label vector for , where C is the number of categories. The goal of cross-modal hashing is to learn two hashing functions, and , to generate binary codes and , ensuring that the Hamming distance between similar samples is smaller and the distance between dissimilar ones is larger. Since binary optimization is a typical NP-hard problem, the continuous relaxation strategy is employed to obtain the binary-like codes.
3.2. Overview of DCGH Framework
As illustrated in
Figure 3, the proposed DCGH framework, in general, consists of two modules: the feature learning module and the hash learning module.
Feature learning module. Inspired by [
4,
27], we introduce feature extraction based on a Transformer encoder into cross-modal retrieval to obtain representative semantic features from both the image and text modalities. Specifically, the image Transformer encoder has the same structure as the VIT encoder [
40], which is composed of stacked 12-encoder blocks. Each encoder block has the same structure, including Layer Normalization (LN), Multi-Head Self-Attention (MSA), and MLP blocks. The number of MSA is 12. For the image semantic feature
,
represents the image semantic encoder and
represents the parameters of the image semantic encoder. The text Transformer encoder consists of 12 encoder blocks, each with 8 MSA. For the text semantic feature
,
represents the text semantic encoder, and
is the parameter of the text semantic encoder.
3.3. Hash Learning
In order to generate hash codes that maintain intra-class aggregation while preserving inter-class structural relationships, a comprehensive objective function composed of proxy loss, pairwise loss, and variance constraint has been designed to optimize the parameters of our proposed DCGH framework. In the following sections, we will detail each component of the DCGH architecture.
3.3.1. Proxy Loss
Proxy-based methods can achieve satisfactory performance in single-label cross-modal retrieval, but when considering multi-label cross-modal retrieval, proxy-based methods have been proven to produce poor performance with limited hash bits because they fail to deeply express multi-label correlations and neglect the preservation of inter-class structural relationships [
20]. However, since only the relationships between data and proxy points need to be considered, the learned hash codes can maintain intra-class aggregation well. Therefore, we first learn hash codes for intra-class aggregation through proxy loss.
For proxy loss [
4], a learnable proxy is first generated for each label category. The hash representation is learned by bringing samples closer to their relevant proxies and pulling away irrelevant data-proxy pairs. For
,
represents the learnable proxies for each label category, where
is a
K-bits vector. When the samples and the proxy are relevant, we can calculate the cosine distance between the binary-like hash codes and relevant proxies by using
, and for proxies that are not related to the samples, the distance between binary-like hash codes and irrelevant proxies is pushed away by using
.
Hence, the proxy loss of image
could be calculated as shown in Equation (
2).
where
I is an indicator function. The denominators represent the number of relevant data-proxy pairs and irrelevant data-proxy pairs, respectively, aiming for normalization [
4]. Similarly, the proxy loss of text,
, can be calculated as shown in Equation (
3).
The total multi-modal proxy loss
is calculated, as shown in Equation (
4).
3.3.2. Pairwise Loss
In order to maintain inter-class structural relationships, we further explore the relationships between data. That is, to bring similar data closer together and to push away dissimilar data. For this, we provide a similarity matrix,
S, which is defined as follows:
where
is the
norm, and
is the transpose of the vector (or matrix). The range of
is [0, 1]. If
> 0, then
(or
), and
(or
) is called an relevant pair. If
= 0, then they are considered irrelevant pairs. To pull relevant pairs closer, we can calculate the cosine distance between relevant data pairs using
; inspired by [
29], we relax the constraints on irrelevant data pairs. The distance between irrelevant data pairs is pushed away using
.
Hence, the relevant loss,
, can be calculated using Equation (
7).
Similarly, irrelevant loss,
, is calculated using Equation (
8).
To address the issue of intra-class dispersion arising from pairwise loss, we assign a small weight to pairwise loss, allowing the aggregation of similar data points to be primarily guided by proxy loss, thereby resolving the issue of intra-class dispersion. Considering the fact that in multi-label datasets, the number of irrelevant pairs in pairwise loss is significantly smaller than the number of relevant pairs and that positive sample pairs have a stronger constraint and are more likely to lead to intra-class dispersion compared to negative sample pairs, we assign different small weights to the positive and negative sample constraints of pairwise loss. Therefore, the overall pairwise loss,
, is given by the following formula:
where
and
are the small weight hyperparameters for positive and negative sample constraints, respectively.
3.3.3. Variance Constraint
As shown in
Figure 2, when considering both the relationships between points and proxies and between points and points, if there are more data points related to proxy P1 than to other proxies, the data points will tend to lean towards proxy P1. However, data points should maintain a consistent relationship with each of their relevant proxies. Therefore, we use variance constraints to maintain a consistent distance relationship between the data and their relevant proxy points. For image data,
,
represents their label, and we use
to denote the index set of its corresponding relevant proxies, where
. Then, the variance constraint for it can be given by the following formula:
where
represents the number of elements in the set,
. Hence, the overall constraint of the image,
, and the text,
, can be calculated using Equation (
11).
The total variance constraint loss,
, is calculated, as shown in Equation (
12).
3.4. Optimization
The training algorithm is a critical component of our proposed DCGH framework and is presented in Algorithm 1. The DCGH model is optimized by standard backpropagation algorithms and mini-batch gradient descent methods. The algorithm for generating the hash code is shown in Algorithm 2. For a query sample, use the well-trained DCGH model to generate binary-like hash codes and then use the sign function to generate the final binary hash codes.
Specifically, when given the query data,
(
), the compact hash code can be generated by Equation (14).
Algorithm 1 Learning algorithm for DCGH |
Input: Training dataset ; Binary codes length K; Hyperparameters: , .
|
Output: Network Parameters: , .
|
- 1:
Initialize network parameters and , maximum iteration number , mini-batch size 128, proxies . - 2:
Construct a similarity matrix from multi-label set ; - 3:
while do - 4:
Capture feature vector and by forward propagation. - 5:
Compute proxy loss by Equation ( 4). - 6:
Compute pairwise loss by Equation ( 9). - 7:
Compute Variance Constraint by Equation ( 12). - 8:
Update proxies by back propagations. - 9:
Update the network parameters and by back propagations. - 10:
end while - 11:
return The trained DCGH model.
|
Algorithm 2 Learning hash codes for DCGH |
Input: Query samples , Parameters for DCGH.
|
Output: Binary hash code for .
|
- 1:
Calculating binary-like hash codes by feeding the query data into the trained DCGH model. - 2:
Calculating hash codes by using the sign function.
|
4. Experiments
In this section, we will present and analyze the experimental results of the proposed method alongside those of several state-of-the-art competitors. Firstly, the details of the experimental setting are introduced in
Section 4.1. We then proceed to discuss performance comparison, ablation studies, sensitivity to parameters, training and encoding time, and visualization in
Section 4.2,
Section 4.3,
Section 4.4,
Section 4.5 and
Section 4.6, respectively.
4.1. Experimental Setting
Datasets. Our experiments were conducted on three commonly used benchmark datasets, i.e., MIRFLICKR-25K [
41], NUS-WIDE [
42], and MS COCO [
43]. A brief introduction to them is listed here:
MIRFLICKR-25K: This is a small-scale cross-modal multi-label dataset collected from the Flickr website. It includes 24,581 image-text pairs corresponding to 24 classes, in which each sample pair belongs to at least one category.
NUS-WIDE: This dataset contains 269,648 image-text pairs, and each of them belongs to at least 1 of the 81 categories. To enhance the dataset’s practicality and compatibility with other research methods, we conducted a selection process, removing categories with fewer samples and choosing 21 common categories. This subset contains 195,834 image-text pairs, with each pair belonging to at least one of the categories.
MS COCO: This dataset is a highly popular large-scale dataset in computer vision research. Comprising 82,785 training images and 40,504 validation images, each image is accompanied by corresponding textual descriptions and labels. It covers 80 different categories. In our study, the training and validation sets were combined, with each sample containing both image and text modalities, and each sample belongs to at least one category.
The statistics of these three datasets are reported in
Table 2.
Implementation details. In our experiments, we implemented a unified sampling strategy across the three datasets. Initially, we randomly selected 5000 image-text pairs from a dataset as the query set, with the remainder serving as the database set. During the model training phase, we randomly chose 10,000 image-text pairs as the training set. To ensure consistency and fairness in the experiments, we performed the same preprocessing operations on the images and text for all datasets: the image sizes were adjusted to 224 × 224, and the text was represented through BPE [
44] encoding.
To ease reading, we report the detailed configuration of each of the components in the proposed DCGH framework in
Table 3.
Experimental environment. We implemented our DCGH via Pytorch, with a GPU leveraging NVIDIA RTX 3090. For the network configuration, the two Transformer encoders of DCGH, ViT [
40] and GPT-2 [
45], were initialized using pre-trained CLIP (ViT-B/32) [
46]. Our model employs an adaptive moment estimation (Adam) optimizer to update the network parameters until convergence [
47]. The learning rate of two backbone Transformer encoders, ViT and GPT-2, was set as 0.00001, while the learning rate of the hash learning module in DCGH was set to 0.001. The two hyperparameters,
and
, were set to 0.05 and 0.8, respectively. The batch size was 128.
Baseline methods. In our experiments, we selected 14 state-of-the-art deep cross-modal hashing methods for comparison, which contain Deep Cross-Modal Hashing [
23], Self-Supervised Adversarial Hashing [
25], Cross-Modal Hamming Hash [
48], Adversary Guided Asymmetric Hashing [
49], Deep Adversarial Discrete Hashing [
31], Self-Constraining Attention Hashing Network [
9], Multi-label Enhancement Self-supervised Deep Cross-modal Hashing [
26], Differentiable Cross-modal Hashing via Multi-modal Transformers [
27], Modality-Invariant Asymmetric Networks [
50], Data-Aware Proxy Hashing [
21], Deep Semantic-aware Proxy Hashing [
4], Deep Neighborhood-aware Proxy Hashing [
22], Deep Hierarchy-aware Proxy Hashing [
33], and Semantic Channel Hash [
29]. Due to some methods not being open source, we directly cite the results from the published papers.
Evaluation protocols. In our experiments, we employed five commonly used evaluation metrics to assess the performance of cross-modal similarity searches, which include mean Average Precision (mAP), Normalized Discounted Cumulative Gain using the top 1000 returned samples (NDCG@1000), Precision with a Hamming radius of 2 (P@H ≤ 2) curve, Precision-Recall (PR) curve, and Precision@Top N curve. The mAP is calculated as the average of the AP across all query samples. The PR curve represents the variation curve between recall and precision. Top-N Precision curve illustrates the proportion of truly relevant data among the top N results returned by the system. NDCG@1000 is a comprehensive metric that assesses the ranking performance of the top 1000 retrieval results. Precision with a Hamming radius of 2 describes the precision of the samples retrieved within the specified Hamming radius. The experimental results from the aforementioned evaluation metrics demonstrate that the DCGH method achieves excellent performance in cross-modal similarity searches.
4.2. Performance Comparison
We validated the performance of DCGH by comparing it with state-of-the-art deep cross-modal hashing methods on image-text retrieval tasks across three public datasets. The mAP results are shown in
Table 4, where “Img2Txt” indicates image-to-text retrieval, and “Txt2Img” indicates text-to-image retrieval. In most cases, DCGH outperforms other baseline methods, achieving satisfactory performance. On the MIRFLICKR-25K dataset, compared to the baseline methods, the SCH method performs the best. In the image-to-text retrieval task, the SCH method is, on average, 0.37% higher than the method in the present study, and in the text-to-image retrieval task, the SCH method is, on average, 0.9% higher than the method in the present study. However, on the NUS-WIDE and MS COCO datasets, as the amount of data increases, the issue of intra-class dispersion leads to unsatisfactory performance for the SCH method. In contrast, our DCGH method considers both intra-class aggregation and inter-class structural relationships, and it essentially achieves the best performance compared to other baseline methods. This confirms the effectiveness of our method. The performance of this method in the multi-label retrieval scenario was evaluated using the NDCG@1000 evaluation metric.
The results of NDCG@1000 can be found in
Table 5. From
Table 5, it can be observed that our method and the DNPH method have comparable scores, with each having its own advantages and disadvantages. The DNPH method, which introduces a uniform distribution constraint on the basis of proxy loss, achieves the optimal NDCG@1000 scores for the image-text retrieval tasks at 16 bits and 32 bits on three public datasets in most cases. However, at 64 bits, as the length of the hash code increases, the discrete space becomes sparser, and DCGH can obtain higher-quality ranking results compared to DNPH.
To further evaluate the performance of our method,
Figure 4,
Figure 5 and
Figure 6 show the PR curves on the MIRFLICKR-25K, NUS-WIDE, and MS COCO datasets at 16 bits and 32 bits, and
Figure 7 shows the results of TopN-precision curves on the MIRFLICKR-25K, NUS-WIDE, and MS COCO datasets regarding 32 bits. The mAP@H ≤ 2 results for different hash code lengths on the MIRFLICKR-25K, NUS-WIDE, and MS COCO datasets are shown in
Figure 8, where
Figure 8a–c display the mAP@H ≤ 2 results for the image retrieval text task.
Figure 8d–f show the mAP@H ≤ 2 results for the text retrieval image task. Compared with the state-of-the-art methods in the baseline, our method achieves comparable or even better results.
4.3. Ablation Studies
To validate the effectiveness of the DCGH method, we implemented three variations to calculate the mAP values for the tasks of the image retrieval of text and the text retrieval of images. In detail, (1) DCGH-P-V: uses only proxy loss, , to train the model. (2) DCGH-X-V: uses only pairwise loss, , to train the model. Since proxy loss was not considered, and were set to 1. (3) DCGH-V: variance constraint, , was not used.
The ablation experiment results are shown in
Table 6. Comparing the results of DCGH-
P-
V and DCGH-
X-
V on three benchmark datasets reveals that DCGH-
P-
V, which only uses proxy loss, generally performs better on large datasets like NUS-WIDE and MS COCO than DCGH-
X-
V, which only uses pairwise loss. As the amount of data increases, the issue of intra-class dispersion caused by pairwise loss can significantly affect the effectiveness of the hash codes. Comparing the performance of DCGH-
V with DCGH-
P-
V and DCGH-
X-
V across the three datasets shows that DCGH-
V, which combines proxy loss and pairwise loss to consider both intra-class aggregation and inter-class structural relationship preservation, significantly outperforms DCGH-
P-
V and DCGH-
X-
V, which only use one type of loss each. This confirms the rationality of our combination approach. By comparing the results of DCGH and DCGH-
V on the three benchmark datasets, it is evident that introducing a variance constraint to prevent semantic bias leads to better results, confirming the effectiveness of the variance constraint. By comparing three variables, the effectiveness of each component of the DCGH method was verified. By combining proxy loss, pairwise loss, and variance constraints, DCGH is able to learn excellent hash codes.
4.4. Sensitivity to Hyperparameters
We also investigated the sensitivity of the parameters
and
. We set their ranges to {0.001, 0.01, 0.05, 0.1, 0.5, 0.8, 1}, and the results are reported in
Figure 9. From
Figure 9a,d, it can be observed that the trend of alpha and beta on the MIRFLICKR-25K dataset is quite intuitive, with the best performance occurring at alpha and beta values of 0.05 and 0.8, respectively.
Figure 9b,c show that on large-scale datasets, as the value of alpha increases, meaning the constraint between positive samples becomes stronger, proxy loss can no longer dominate the training of the hashing network, leading to intra-class dispersion and a significant drop in mAP scores.
Figure 9f indicates that due to the more relaxed constraint on negative samples, the increase in beta has a smaller impact on the hashing network. From
Figure 9e, we also find an interesting phenomenon that the value of beta has almost no effect on the mAP scores on the NUS-WIDE dataset, which we believe may be due to the overwhelming size of the NUS-WIDE dataset with only 21 categories, and the proportion of unrelated samples is too small. This also demonstrates the ingenuity of our approach to set separate weight parameters for positive and negative sample constraints in pairwise loss.
4.5. Training and Encoding Time
To investigate the efficiency of the model, we compared the training time and encoding time of the DCGH algorithm with other Transformer-based cross-modal hashing algorithms on the MIRFLICKR-25K dataset using 32-bit binary codes, with the results shown in
Figure 10. During the optimization of the DCGH model, a proxy loss algorithm was proposed, which has a time complexity of
, and
C is the number of categories. Additionally, the time complexities for the pairwise loss and variance constraint were
and
, respectively. Here,
represents the ratio of the number of related data-proxy pairs to the total number of data-proxy pairs, with
being less than 1. Therefore, the total time complexity of our algorithm is
, which is approximately equal to
. As shown in
Figure 10a, compared with several other state-of-the-art methods, our training time is average, and since the training process of these methods is offline, the training time does not affect the performance of the method. As shown in the comparison of encoding time in
Figure 10b, the encoding time of all methods is within the millisecond range, indicating that our method achieves comfortable encoding time.
4.6. Visualization
To explore the ability of our model to bridge the semantic gap between different modalities, we used the T-sne [
51] technique to project discrete hash codes into a two-dimensional space on three public datasets. As shown in
Figure 11, different colored points represent data from different modalities. As we can see, the data from the text modality and the image modality are aligned very well.
To explore the quality of the hash codes and the intra-class aggregation, we selected samples from seven different single-label categories in the NUS-WIDE dataset and performed T-sne visualization on 16-bit hash codes using the DCHMT, SCH, and DCGH methods. The results are shown in
Figure 12; the different colored dots represent different single-label category data. Through comparison and analysis, we can see that our method can obtain higher quality and more intra-class aggregated binary hash codes due to the joint training of the model with proxy loss, pairwise loss, and variance constraint terms.