1. Introduction
With the prosperity of multimedia technology and smart devices, a tremendous number of multi-modal data (e.g., text, image, video, and audio) have been pouring into the Internet [
1,
2,
3,
4,
5]. Despite differences in structures, various types of data are usually semantically related to each other. These semantic relationships could be used for data retrieval or sharing. Naturally, cross-modal retrieval technology [
6,
7,
8,
9] has become a desiderata since it efficiently returns one modality as a result to queries of other modalities by effectively mining the intrinsic semantic relationships.
The primary issue of cross-modal retrieval is to reduce the heterogeneity gap between modalities. Most existing approaches try to address this issue by projecting the original features of data into a common real-valued subspace in which the semantic similarity can be easily measured [
10,
11,
12,
13,
14,
15,
16,
17,
18,
19]. Unfortunately, due to the explosive growth of data, the computational complexity of real-valued cross-modal retrieval has become an unavoidable challenge. A viable solution is cross-modal hashing [
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30], which maps high-dimensional multi-modal features into compact binary codes and the cross-modal similarity can be calculated by XOR operation efficiently.
Depending on whether category information is used during the training stage, existing cross-modal hashing methods are mainly classified into unsupervised and supervised manners. Generally, without category information, it is difficult for unsupervised cross-modal hashing [
31,
32,
33,
34,
35,
36,
37,
38] to generate cross-modal hash codes with strong semantic discrimination even though they endeavor to learn latent similarity structures among different modalities. Supervised cross-modal hashing [
39,
40,
41,
42,
43,
44,
45], in contrast, is able to use category labels to enhance cross-modal semantic discrimination so as to obtain high-quality hash codes. As a special case of supervised hashing learning, multi-label cross-modal hashing methods [
29,
46,
47,
48,
49,
50] aim to efficiently handle the instances associated with multiple labels [
51,
52,
53]. Unlike traditional single-label hashing methods that focus on binary similarity for individual instances, they use multiple labels to construct a semantic similarity matrix so as to learn more accurate similaritiy relationships (e.g., each similarity is defined as a real value between −1 and 1). Meanwhile, motivated by the remarkable achievements in contrastive learning, some contrastive learning methods [
30,
54] for cross-modal hashing are introduced which aim to capture cross-modal similarities more effectively by comparing samples across modalities.
Motivation. Recently, some great progress has been made in cross-modal hashing with contrastive learning, such as [
30,
54]. UCCH [
54] represents the initial endeavor in employing contrastive learning within unsupervised cross-modal hashing. It introduces a cross-modal ranking learning loss (CRL) to alleviate the influence of false-negative pairs. Conversely, Unihash [
30] leverages a contrastive label correlation learning (CLC) loss to establish connections between diverse modalities through category labels. However, they simply apply InfoNCE [
55,
56] in cross-modal hashing, which treats an image–text pair as a positive sample; otherwise, it is a negative sample. This contrastive learning strategy gives rise to the false-negative problem, where samples belonging to the same class are incorrectly regarded as negative samples, resulting in the learning of erroneous relationships among cross-modal instances. Hence, a widely adopted approach in supervised contrastive learning [
57] is to consider instances as similar if they share at least one common category. Taking
Figure 1 as an intuitive example, all the semantic similarities between image–text pairs
,
,
and
are considered as 1 due to at least one shared category, i.e., “tree”. However, as
,
, and
share one, three, and four labels with
, respectively, the semantic similarity between
and other instances should be ordered as
, rather than
, where
denotes the semantic similarity matrix. Indisputably, a supervised contrastive learning strategy is not suitable for multi-label scenarios because this naive binary similarity cannot accurately reflect the complex semantic relationships between cross-modal instances. Thus,
the first challenge we have to face is how to combine multi-label and supervised contrastive learning to consider diverse relationships among cross-modal instances. More than that, this problem occurs in the similarity matrix construction of most supervised cross-modal methods, i.e., instances are considered similar if they share at least one common category and dissimilar otherwise. Several studies, as pioneers, try to subtly use shared labels to accurately describe the semantic relationships. For example, ref. [
29] uses bi-direction relation reasoning to calculate multi-label similarity in two directions so as to improve semantic similarity matrix construction, where bi-direction relation includes consistent direction relation and inconsistent direction relation. The consistent direction relation is the similarity between two similar instances that share at least one common category, while the inconsistent direction relation refers to the degree of dissimilarity of two instances that do not share any category. However, this method has a glaring flaw; due to the sparsity of multi-labels, as shown in
Figure 2a, too many zeros are shared during semantic similarity matrix construction in the consistent direction relation, which makes the method unable to accurately represent the semantic similarity between instances. Thus,
the second challenge we need to overcome is how to efficiently reduce the sparsity of multi-label representation during semantic similarity measurement.
Our Method. To defeat the above challenges, this paper proposes a novel
Multi-
Label
Weighted
Contrastive
Cross-modal
Hashing (
MLWCH) method. As shown in
Figure 3, on one hand, a novel multi-label similarity measurement, termed
compact consistent similarity representation, is proposed to improve the accuracy of semantic similarity calculation by producing more compact label vectors. As illustrated in
Figure 2b, this technique is capable of reducing the dimensionality of the label vectors by eliminating redundant zero elements, thereby mitigating the potential impact of excessive zeros so as to result in a more compact similarity representation. On the other hand, we extend supervised contrastive learning into multi-label scenarios via a new designed
multi-label weighted contrastive learning strategy. With the engagement of compact consistent similarity representation, this novel learning strategy assigns different weights to positive samples according to both linear and non-linear similarity relationships.
Contributions. The main contributions of this paper are fourfold:
We develop a novel multi-label cross-modal hashing framework called MLWCH to learn high-quality hash codes. To the best of our knowledge, MLWCH acts as a pioneer in attempting to enhance cross-modal hashing via multi-label contrastive learning supported by more precise semantic similarity representation.
We propose a novel multi-label similarity measurement, called compact consistent similarity representation, to construct a high-quality semantic similarity matrix. By reducing the sparsity of label vectors through eliminating redundant zero elements, it achieves a more compact similarity calculation and focuses on informative and crucial non-zero elements.
We design a novel multi-label weighted contrastive learning strategy by marrying supervised contrastive learning with compact consistent similarity representation, which assigns different weights to different positive samples by considering both linear and non-linear similarities.
We conducted extensive experiments, including performance comparison, ablation study, and hyperparameter sensitivity analysis, on three well-known benchmark datasets. The remarkable results demonstrate the superiority of our method.
Roadmap. The rest of the paper is organized as follows.
Section 2 reviews the related works.
Section 3 presents the details of our multi-label weighted contrastive cross-modal hashing (MLWCH) framework and its optimization.
Section 4 shows the evaluation of MLWCH as well as comparison experimental results on three datasets.
Section 5 concludes this paper.
3. The Proposed Methodology
In this section, we present our approach, MLWCH, which integrates compact consistent similarity representation and multi-label weighted contrastive learning to generate high-quality cross-modal hash codes. Firstly, we lay out the notational groundwork and give the problem a definition formally, then proceed to our proposed method, including the framework, hash learning strategy, and optimization algorithm.
3.1. Notations and Problem Definition
Notations. Without loss of generality, sets are denoted as Euler script uppercase letters (e.g.,
). Matrices, vectors, and scalars are denoted by bold uppercase letters (e.g.,
), bold lowercase (e.g.,
), and uppercase/lowercase (e.g.,
N or
n).
denotes the
-th element of matrix
. The transposition of matrix
is denoted as
. Vector space with dimensions
is presented as blackboard bold uppercase (e.g.,
).
,
and
denotes
,
, and Frobenius norm, respectively. Functions or models are denoted as calligraphy uppercase letters (e.g.,
). To facilitate reading, the frequently used mathematical notations are summarized in
Table 1.
Problem Definition. This study considers two commonly used modalities, i.e., image and text. Suppose that there is a multi-label cross-modal dataset containing N instances, where refers to the i-th instance; , , and are the original image features, text features, and label vectors corresponding to instance , respectively. C indicates the number of categories. In particular, if instance is labeled with the j-th category, then ; otherwise, . Furthermore, the label vectors can be used to construct an semantic similarity matrix , where indicates the semantic similarity between and .
The goal of our work is to learn two hash models and (one per modality) on a training dataset to generate hash representations from original instances. The element-wise sign function is used to obtain the uniform binary hash code from continuous hash representations, where k is the code length. In order to conduct cross-modal hashing retrieval, the Hamming distance is involved to measure similarities between instances. Putting it formally, , if , and vice versa, where , , . denotes the Hamming distance between two binary vectors; and are the model parameters. In this work, the proposed learning framework includes two groups of hash models: one for hash representation learning, denoted as and , and the other for hash function learning, denoted as and .
3.2. Overview of MLWCH Framework
As shown in
Figure 3, the framework of MLWCH mainly consists of four modules: (i)
the compact consistent similarity representation module, (ii)
the multi-label weighted contrastive learning module, (iii)
the hash representation learning module, and (iv)
the hash function learning module.
3.3. Compact Consistent Similarity Representation
The prevailing solutions of supervised cross-modal hashing typically define a binary semantic similarity relationship. That is, the value of similarity is “0” or “1”. This naive manner is unable to accurately measure the complex semantic similarity between two instances. Although it is an advanced similarity calculation, similarity representation in the consistent direction calculated by bi-directional relational reasoning [
29] shares too many zeros due to the sparsity of label vectors, which unfortunately hinders accurate similarity measurement between instances. To this end, we propose a novel multi-label semantic similarity measurement called compact consistent similarity representation. As a refined version of bi-direction relation reasoning, this semantic similarity representation method modifies the similarity calculation in the consistent direction. Specifically, two cases are considered: (i) If two instances
and
share at least one label, i.e.,
, their semantic similarity
is defined as
where the · symbol denotes a vector inner product. In Equation (
1), the numerator is the number of the shared categories of the
i-th and
j-th instances, and the denominator is the number of all categories contained by the
i-th or
j-th instance. Therefore,
can measure the similarity between
and
more accurately than bi-direction relation reasoning since the labels that do not belong to either instance are excluded, which makes similarity measurement intensively focus on semantic relationships caused by shared labels. (ii) If
and
have no shared label, i.e.,
, their similarity is defined as
where ⊕ is the XOR operation and
C is the number of categories. Thus,
represents the dissimilarity between
and
.
3.4. Multi-Label Weighted Contrastive Learning
The core idea of contrastive learning is maximizing mutual information (MI) between similar instances. MI measures the correlation between two variables and quantifies the amount of information they share. However, in practical applications, the joint distribution and marginal distribution between two variables are often unknown, which is troublesome for MI calculation. Therefore, contrastive learning typically employs some approximate calculation methods. For example, InfoNCE [
55,
56] provides a low-variance estimate of MI for high-dimensional data. However, InfoNCE is not a supervised learning strategy since it is essentially unable to involve category information. To effectively adapt this learning strategy to multi-label cross-modal hashing, we start with the design of cross-modal supervised contrastive loss, then extend it into a multi-label learning task; as a result, we develop multi-label weighted contrastive loss.
Cross-Modal Supervised Contrastive Loss. Inspired by [
57], we attempt to extend InfoNCE to a cross-modal supervised learning scenario. Firstly, we select instance pairs that share some categories as positive pairs, and then construct intra-modality InfoNCE loss for both image and text modalities, shown as follows:
where
is the total number of samples that share some common labels with
,
is a function indicating that if the condition
c is true,
; otherwise,
. Combining the above two loss functions, the intra-modality InfoNCE loss is defined as:
In an analogous manner, the inter-modality infoNCE losses for image and text are defined as:
Accordingly, the inter-modality InfoNCE for multi-label cross-modal hashing learning is defined as:
Combining Equations (
5) and (
8), we obtain the cross-modal supervised contrastive loss as follows:
where
is a trade-off factor of two InfoNCE losses.
Multi-Label Weighted Contrastive Loss. Obviously, the above supervised contrastive loss regards all positive samples as equally important. As discussed in
Section 1, however, it is not suitable for multi-label scenarios. To break through this limitation, we suppose that different positive samples (i.e.,
) should be assigned different weights according to shared labels. To this end, a novel multi-label weighted contrastive loss is developed. In particular, two different weights, i.e., linear and non-linear weights, are defined to represent multi-label semantic similarities, shown as follows:
The linear weight
is to directly capture the relationship between positive samples by counting the shared labels, while the non-linear weight
, as a form of complementarity, measures more complex similarity. In addition to the quantity relationship of shared labels, for example, cosine similarity considers the angle and direction relationship between different feature representations as well. Therefore, to comprehensively consider the similarity relationship between instances, we obtain the overall weight
via combining Equations (
10) and (
11) as follows:
where
is a trade-off factor to balance the two weights. Then, we normalize the weight as follows:
By introducing the overall weight
, the multi-label weighted contrastive loss functions are defined as follows. For intra-modality:
Compared with traditional supervised contrastive loss, the multi-label weighted contrastive loss assigns a weight to positive samples, which makes the significance of positive samples proportional to their number of shared labels with the anchor sample. Similar to Equations (
5) and (
8), the intra- and inter-modal weighted contrastive losses are presented as:
Finally, combining Equations (
18) and (
19), we obtain the multi-label weighted contrastive loss as follows:
where
is a trade-off factor to balance the two weighted InfoNCE losses.
3.5. Hash Representation Learning
Beyond all doubt, preserving semantic consistency between original instances and their hash representations is a key factor of high-quality hash code generation. In other words, the more similar semantically the instance pair ( and ) is, the smaller the Hamming distance between their hash codes should be, and vice versa. To this end, we construct the hash representation learning objective function by integrating the following three losses: (i) an intra-modal semantic similarity loss, (ii) an inter-modal semantic similarity loss, and (iii) a multi-label weighted contrastive loss.
Similarity Matrices. To construct intra- and inter-modal semantic similarity loss, the cross-modal semantic similarities should be represented firstly. Specifically, three similarity matrices, i.e., , are constructed by the inner product between normalized hash representations: , , and , where , and are the semantic similarity between and , and and , as well as and , respectively.
Intra-Modal Semantic Similarity Loss. According to the above similarity matrices, we define the intra-modal semantic similarity loss for both modalities to preserve intra-modal semantic consistency:
Inter-Modal Semantic Similarity Loss. To preserve inter-modal semantic consistency, we use the inter-modal semantic similarity loss to effectively capture the heterogeneous similarities across different modalities:
Combining Equations (
21) and (
22), we obtain the intra-modal semantic similarity loss as follows:
Combining Equations (
23) and (
24), we obtain the total semantic similarity loss as follows:
Finally, we construct the hash representation learning objective function
that consists of the above two losses, shown as follows:
where
is a trade-off factor.
3.6. Hash Function Learning
We learn another two modality-specific hash models, i.e.,
and
, to generate a binary hash code from each instance
in the following manner:
where
is the sign function;
denotes the binary hash codes of instance
. Then, we treat the learned hash representation as a supervised signal to guide the hash function learning:
To reduce quantization error, the following quantization loss is involved:
Combining Equations (
28) and (
29), we obtain the total loss for hash function learning as follows:
3.7. Optimization
The learning process of the proposed MLWCH consists of two stages: (i) the hash representation learning stage and (ii) the hashing function learning stage. We adopt the Adam adaptive algorithm [
63] for optimization. In the hash representation learning stage, we minimize Equation (
26) to optimize the hash representation learning model
and
as follows:
In the hash function learning stage, we minimize Equation (
30) to optimize the hash function learning model
and
as follows:
Overall, the entire optimization procedure of MLWCH is presented in Algorithm 1.
Algorithm 1: Optimization procedure for MLWCH. |
Input: Number of training image–text pairs N; number of epochs e; hash code length k; batch size b; learning rate of the network and ; hyperparameter , , , ; |
Output: Optimized parameters , , , ; |
- 1:
Construct a semantic similarity matrix from multi-label set ; - 2:
repeat - 3:
Iterative training e times; - 4:
// Hash representation learning stage: optimizing objective function Equation ( 31) - 5:
for each do - 6:
Randomly select b training image–text pairs; - 7:
Generate continuous hash representation through , ; - 8:
Calculate the loss by Equation ( 26) and update the parameters , through back propagation as follows: - 9:
end for - 10:
// Hash function learning stage: optimizing objective function Equation ( 32) - 11:
for each do - 12:
select b training image–text pairs and hash representation ; - 13:
Map the original feature and into a hash code through ; - 14:
Calculate the loss by Equation ( 30), and update the parameter , through back propagation as follows: - 15:
end for - 16:
until epoch
|
4. Experiment
To evaluate the performance of the proposed method holistically, extensive experiments are carried out on three widely used cross-modal retrieval datasets: MIRFLICKR-25K [
64], NUS-WIDE [
65], and MS COCO [
66]. This section firstly introduces the experimental settings, including datasets, evaluation metrics, baselines, and implementation details. Then, we delve into the performance comparison of MLWCH with the baselines, the ablation study, and the hyperparameter sensitivity analysis.
4.1. Datasets
MIRFLICKR-25K. The original MIRFLICKR-25K is made up of 25,000 image–text pairs from the Flickr website. In our experiment, we remove those pairs that have fewer than 20 tags, and finally obtain 20,015 image–tag pairs. Then, we extract 4096-dimensional CNN (AlexNet [
67]) features to represent each image, and 1386-dimensional Bag-of-Words (BoW) [
68] features to represent each text.
NUS-WIDE. The original NUS-WIDE dataset contains 269,468 image–text pairs. We first abandon the data without categories, then choose data classified by the 10 most frequent categories to construct a subset which has 186,577 image–text pairs. For our experiments, we encode each image into a 4096-dimensional feature by AlexNet and each textual tag into a 1000-dimensional BoW feature.
MS COCO. This dataset contains 123,287 image–text pairs in 80 independent categories in total. Similar to the above datasets, a 4096-dimensional feature vector is generated by AlexNet for each image, and a BoW model is adopted to represent its corresponding text with 2000 dimensions.
4.2. Evaluation Metrics
To objectively evaluate the performance of our proposed method and compare it with the baseline methods, two frequently used cross-modal hashing evaluation protocols, i.e., Hamming ranking and hash lookup [
69], are utilized in our experiments. The former ranks samples in the retrieval set by their Hamming distance to the query in ascending order, while the latter retrieves samples within a certain Hamming radius from the query [
43]. The mean average precision (MAP) is used to measure the accuracy of the Hamming ranking protocol, while precision-recall curves (PR curves) are commonly used to measure the accuracy of the hash lookup protocol. Given a query
, the AP score of top
n results is calculated by
where
is an indicator function; if the
ith retrieved sample is similar to the query, i.e., sharing at least one common category with the query,
; otherwise,
.
N denotes the number of relevant samples in the returned top
n samples. MAP is the average of APs for all queries:
where
K is the size of the query set.
Moreover, during evaluation, an image and a text will be treated as a similar pair if they share at least one common label.
4.3. Baselines and Implementation Details
Baselines. We compare the proposed MLWCH method with nine classical or state-of-the-art cross-modal hashing methods, including three shallow model-based methods, (CVH [
40], SCM [
41], and CCA-ITQ [
70]), and six deep model-based methods, (DCMH [
43], SSAH [
44], DCHUC [
26], SCCGDH [
27]), MMACH [
28], and Bi_NCMH [
29]. A brief introduction of each is presented below:
CVH aims to learn a hash function to efficiently find similar data items in the hash space by mapping data from different views to hash codes.
SCM introduces a large-scale supervised multi-modal hashing approach that emphasizes the idea of semantic relevance maximization for efficient similarity search between different modal data.
CCA-ITQ introduces an iterative quantization method to gradually improve the quality of binary codes through multiple iterations.
DCMH is the first attempt to integrate deep learning and hash learning into a unified framework for end-to-end learning.
SSAH is a self-supervised adversarial hashing method which treats labels as a single modality to supervise the learning process of semantic features and integrates adversarial learning into cross-modal hashing in a self-supervised manner.
DCHUC utilizes an iterative learning optimization algorithm to jointly learn hash codes and hash functions, where the learned hash codes and functions can supervise each other during the optimization process.
SCCGDH is a class-specific center-guided deep hashing method which makes use of hash codes of labels generated from labeled networks for class-specific centers and efficiently guides hash learning for image and text modalities.
MMACH integrates a new multi-label modality augmented attention module with self-supervised learning to supervise the training of hash functions for image and text modalities based on augmented multi-labels.
Bi_NCMH is a bi-directional relational reasoning-based deep cross-modal hashing method that builds a multi-label semantic similarity matrix through consistent and inconsistent relationships between instances.
For a fair comparison, following [
27], we utilize AlexNet pre-trained on ImageNet for extracting image features, and employ the Bag-of-Words (BoW) model for extracting text features. For MMACH and Bi_NCMH, as their source code was not available, we carefully implemented these methods.
Implementation Details. As shown in
Figure 3, each of our hash models is composed of a two-layer multi-layer perceptron with
, formally represented as:
is
and
is
.
Three hyperparameters,
,
, and
, are involved in multi-label weighted contrastive learning. As introduced above,
is the trade-off factor between linear and non-linear weight,
plays the trade-off between intra- and inter-modality weighted contrastive loss, and
is the temperature coefficient in contrastive learning. For hash representation learning, we use
to adjust the importance of
and
. In our experiments, for the parameter
, we assign values of 0.3, 0.6, and 0.2 for MIRFlickr-25K, NUS-WIDE, and MS COCO, respectively. Regarding the parameter
, we set it to 0.1 for all datasets. Similarly, for the parameter
, we set it to 0.4, 0.46, and 0.26 for MIRFlickr-25K, NUS-WIDE, and MS COCO, respectively. Furthermore, for the parameter
, we assign values of 0.4, 0.2, and 0.1 for MIRFlickr-25K, NUS-WIDE, and MS COCO, respectively. The Adam optimization algorithm [
63] is adopted for model training, and we set the learning rate of the hash representation learning
and hash function learning
to 0.001 and 0.0001, respectively, on MIR Flickr-25K; 0.001 and 0.0001, respectively, on NUS-WIDE; and 0.0015 and 0.0005, respectively, on MS COCO. The batch size is set to 512. For all experiments, two cross-modal retrieval tasks are considered:
and
, where
represents the cases when using a querying image while returning text, and
represents the cases when using a querying text while returning an image.
Experimental Environment. All experiments were implemented using Python 3.8 on PyTorch 1.12.1 framework, running on a deep learning workstation with Intel(R) Core i9-12900K 3.9 GHz, 128 GB RAM, 1 TB SSD and 2 TB HDD storage, and 2 NVIDIA GeForce RTX 3090Ti GPUs with Ubuntu-22.04.1 operating system.
4.4. Performance Comparisons and Discussion
We investigate the retrieval performance of the proposed method MLWCH by comparing it with several state-of-the-art baselines on the MIRFLICKR-25K, NUS-WIDE, and MS COCO datasets. In the following, we discuss the comparison via Hamming ranking and hash lookup.
Hamming Ranking. The MAP@50 of MLWCH and baseline methods under distinct hash code lengths 16 bits, 32 bits, and 64 bits are listed in
Table 2,
Table 3 and
Table 4. From the experimental results, we have the following findings:
The proposed method MLWCH has shown remarkable performance on all benchmark datasets: it beats both the hand-crafted methods and the deep neural network-based methods in all cases. Particularly, our method outperforms SCCGDH, the strongest competitors, by a significant margin. For the I2T task, the results were 0.0339 (16 bits), 0.0392 (32 bits), and 0.0524 (64 bits) on MIRFLICKR-25K; 0.0335 (16 bits), 0.0174 (32 bits), and 0.0364 (64 bits) on NUS-WIDE; and 0.0755 (16 bits), 0.0499 (32 bits), and 0.0495 (64 bits) on MS COCO. For the T2I task, the results were 0.0708 (16 bits), 0.0425 (32 bits), and 0.0386 (64 bits) on MIRFLICKR-25K; 0.0217 (16 bits), 0.0231 (32 bits), and 0.0268 (64 bits) on NUS-WIDE; and 0.0609 (16 bits), 0.0801 (32 bits), and 0.0609 (64 bits) on MS COCO. These outstanding results verify that integrating compact consistent similarity representation with multi-label weighted contrastive learning can effectively enhance the performance of cross-modal hashing retrieval.
It is clear to see that the deep hashing methods achieve superior retrieval performances than traditional shallow hashing methods in most cases on the three datasets. The main reason may be that deep learning methods can extract more essential high-level features than traditional shallow methods, which effectively reduces the semantic gap between modalities.
It can be found that other than the proposed method, all these baselines also achieve relatively lower results on MS COCO than the other two datasets. This observation is mainly due to the fact that the MS COCO dataset provides more label categories than the two other datasets, which in other words brings greater challenges to cross-modal hash learning. Nevertheless, MLWCH still achieves the best results, which corroborates that the proposed technique could efficiently capture cross-modal semantic consistency in complex semantic conditions.
Compared to deep cross-modal hashing methods utilizing multi-labels, our proposed method MLWCH still obtains the highest performance. Particularly, it in all cases greatly defeats Bi_NCMH, which uses bi-direction relation reasoning. The superiority of MLWCH is partly due to the fact that the proposed technique realizes a more accurate multi-label similarity consistency reasoning to calculate the semantic relevance of original instances. In addition, the combination of multi-label learning with supervised contrastive learning can effectively minimize the heterogeneity gap of original instances, which delivers superior hash learning performance.
Hash Lookup. By varying the Hamming radius from 0 to
k, we plot the PR curves for hash code lengths of 16 bits, 32 bits, and 64 bits on the MIRFLICKR-25K, NUS-WIDE, and MS COCO datasets, respectively. These curves are depicted in
Figure 4,
Figure 5 and
Figure 6. It is obvious that, in all cases, the PR curves of the proposed method are evidently higher than those of all baselines, which verifies that MLWCH can learn cross-modal semantic relationships more efficaciously than the prevailing solutions.
4.5. Ablation Study
As mentioned above, compact consistent similarity representation and multi-label weighted contrastive learning are two instrumental components of MLWCH. To fully evaluate the effectiveness of MLWCH, we conducted a thorough ablation study to analyze the effectiveness of these two techniques, respectively.
4.5.1. Effectiveness of Compact Consistent Similarity Representation
We evaluate the effectiveness of the compact consistent similarity representation module. Before experiment, we vary MLWCH method by the following two ways. Firstly, we replace the multi-label similarity measurement with a naive manner that is commonly used by prevailing solutions: If the cross-modal samples share at least one label, the similarity is 1; otherwise, it is 0. For the sake of discussion, this variation is named SLWCH. Secondly, we replace the multi-label similarity matrix in MLWCH with the multi-label similarity matrix in Bi_NCMH [
29] and keep other parts unchanged. This variation is termed MLBRH. Afterward, we compare the hashing performance of MLWCH with SLWCH and MLBRCH on MIRFLICKR-25K, NUS-WIDE, and MS COCO. The corresponding MAP@50 values for hash code lengths of 16, 32, and 64 on the three datasets are presented in
Table 5,
Table 6, and
Table 7, respectively.
As can be seen from
Table 5,
Table 6 and
Table 7, on one hand, the MAP@50 of MLBRH and MLWCH are higher than that of SLWCH in all cases, which indicates that compared with single-label semantic similarity, the complex semantic relationships between instances are more accurately captured by multi-label semantic similarity. On the other hand, the performance of MLWCH is overall superior to MLBRH on three datasets, which confirms that the proposed compact consistent similarity representation module efficaciously improves the accuracy of cross-modal hashing retrieval. This is mainly because the proposed technique is instrumental in modeling sparse multi-labels so as to capture the semantic similarities between instances more accurately, which is a characteristic not possessed by Bi_NCMH [
29], in contrast.
4.5.2. Effectiveness of Multi-Label Weighted Contrastive Learning
To verify the effectiveness of the proposed multi-label weighted contrastive learning, we make a variation of MLWCH, named MLSCH, by replacing multi-label weighted contrastive loss in Equation (
20) with the loss in Equation (
9). To compare MLSCH with MLWCH, we also consider the MAP@50 metrics of them under the different hash code lengths. The results on MIRFLICKR-25K, NUS-WIDE, and MS COCO are given in
Table 8,
Table 9, and
Table 10, respectively.
For both I2T and T2I tasks, as manifested in
Table 8,
Table 9 and
Table 10, the retrieval accuracy of MLWCH is higher than that of MLSCH, which confirms that, guided by the proposed loss
, the quality of cross-modal hash code learning can be improved noticeably. The main reason behind this phenomenon is that this novel contrastive learning strategy enables the model to perceive more precise similarity relationships by assigning different weights to different positive instances.
4.6. Hyperparameter Sensitivity Analysis
This section sheds light on the sensitivity of , , , and on the MIRFLICKR, NUS-WIDE, and MS COCO datasets. To explore the comprehensive impact of them on hashing learning, the average accuracy of I2T and T2I tasks is used to visualize the trend of cross-modal hashing performance. All these analyses are conducted with hash code length 16.
4.6.1. Trade-Off Parameter
As shown in Equation (
12), the hyperparameter
is a trade-off factor of linear and non-linear weights. We observe the performance change of MLWCH by varying
. From
Figure 7, when we set
to 0.3, 0.6, and 0.2 on MIRFLICKR-25K, NUS-WIDE, and MS COCO, respectively, our method obtains the best performance. Under this circumstance, the cross-modal hashing model focuses almost equally on linear and non-linear semantic similarity between instances. Undoubtedly, this manner is essential for comprehensively expressing complex cross-modal semantic relationships.
4.6.2. Trade-Off Parameter
As presented in Equation (
20), hyperparameter
is to balance the two components of multi-label weighted contrastive loss, i.e., intra- and inter-modal weighted InfoNCE losses
and
. To analyze the effect by
and
, we recorded the performance change by varying the value of
on MIRFLICKR-25K, NUS-WIDE and MS COCO datasets.
Figure 8 reports that our method MLWCH achieves the best MAP score when
is set to 0.1. Albeit
can constrain the intra-modality semantic structure, the performance gain is relatively minor for the cross-modal hashing task. In contrast, inter-modal weighted InfoNEC loss has a relatively greater effect on hashing learning. We conjecture that this result is mainly because inter-modal contrastive strategy is more important for eliminating cross-modal heterogeneity. The above results also clarify that by selecting an appropriate value of
, our model can achieve impressive performance.
4.6.3. Temperature Parameter
As mentioned in previous work [
62], the temperature hyperparameter
can obviously affect the model performance in contrastive learning. To test and verify this viewpoint on the proposed multi-label weighted contrastive learning, we analyze the hyperparameter
of sensitivity on the MIRFLICKR-25K, NUS-WIDE, and MS COCO datasets.
Figure 9 illustrates the effect of
in MLWCH on these datasets, respectively, making it easy to see that the comprehensive retrieval accuracy varies significantly with the change of
, and MLWCH achieves the best MAP scores when
is set to 0.4, 0.46, and 0.26 on MIRFLICKR-25K, NUS-WIDE, and MS COCO, respectively. This supports the claim that by selecting an appropriate value of
, our method can achieve the best performance.
4.6.4. Trade-Off Parameter
As depicted in Equation (
26), the hyperparameter
serves as a trade-off factor between the multi-label weighted contrastive loss and the semantic similarity loss. From the observations in
Figure 10, it is evident that by setting
to 0.4, 0.2, and 0.1 on the MIRFLICKR-25K, NUS-WIDE, and MS COCO datasets, respectively, our method attains the optimal performance. This result demonstrates the effectiveness of incorporating both the semantic similarity loss and the multi-label weighted contrastive learning loss. By carefully selecting an appropriate value for
, the proposed method can achieve superior performance.