Unsupervised Deep Relative Neighbor Relationship Preserving Cross-Modal Hashing

Yang, Xiaohan; Wang, Zhen; Wu, Nannan; Li, Guokun; Feng, Chuang; Liu, Pingping

doi:10.3390/math10152644

Open AccessArticle

Unsupervised Deep Relative Neighbor Relationship Preserving Cross-Modal Hashing

by

Xiaohan Yang

¹,

Zhen Wang

^1,2,*

,

Nannan Wu

¹,

Guokun Li

¹,

Chuang Feng

¹ and

Pingping Liu

²

¹

School of Computer Science and Technology, Shandong University of Technology, Zibo 255000, China

²

Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(15), 2644; https://doi.org/10.3390/math10152644

Submission received: 18 June 2022 / Revised: 23 July 2022 / Accepted: 25 July 2022 / Published: 28 July 2022

(This article belongs to the Special Issue Advances in Pattern Recognition and Image Analysis)

Download

Browse Figures

Versions Notes

Abstract

:

The image-text cross-modal retrieval task, which aims to retrieve the relevant image from text and vice versa, is now attracting widespread attention. To quickly respond to the large-scale task, we propose an Unsupervised Deep Relative Neighbor Relationship Preserving Cross-Modal Hashing (DRNPH) to achieve cross-modal retrieval in the common Hamming space, which has the advantages of storage and efficiency. To fulfill the nearest neighbor search in the Hamming space, we demand to reconstruct both the original intra- and inter-modal neighbor matrix according to the binary feature vectors. Thus, we can compute the neighbor relationship among different modal samples directly based on the Hamming distances. Furthermore, the cross-modal pair-wise similarity preserving constraint requires the similar sample pair have an identical Hamming distance to the anchor. Therefore, the similar sample pairs own the same binary code, and they have minimal Hamming distances. Unfortunately, the pair-wise similarity preserving constraint may lead to an imbalanced code problem. Therefore, we propose the cross-modal triplet relative similarity preserving constraint, which demands the Hamming distances of similar pairs should be less than those of dissimilar pairs to distinguish the samples’ ranking orders in the retrieval results. Moreover, a large similarity marginal can boost the algorithm’s noise robustness. We conduct the cross-modal retrieval comparative experiments and ablation study on two public datasets, MIRFlickr and NUS-WIDE, respectively. The experimental results show that DRNPH outperforms the state-of-the-art approaches in various image-text retrieval scenarios, and all three proposed constraints are necessary and effective for boosting cross-modal retrieval performance.

Keywords:

cross-modal retrieval; image-text retrieval; cross-modal similarity preserving; hashing algorithm; unsupervised learning

MSC:

68T45

1. Introduction

Nowadays, people expect to comprehensively recognize and learn an object based on different modal information, such as image, text, video and audio. However, the traditional single-modal retrieval algorithms only return the nearest neighbors with the same modality as the query sample. To solve the above problem, more and more researchers are becoming interested in the cross-modal retrieval algorithm, which can achieve the tasks of retrieving different modal nearest neighbors, such as text retrieving image or image retrieving text [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25].

According to the feature representation, we roughly divide the existing cross-modal retrieval algorithms into two categories, the real-value feature-vector-based cross-modal retrieval algorithms [1,3,4,5,6,12,14,26,27] and the hash-based cross-modal retrieval algorithms [2,7,8,9,10,11,15,16,17,18,19,20,21,22,23,24,25]. The real-value feature-vectors-based algorithms, such as the joint feature selection and subspace learning cross-modal retrieval algorithm [14], learn different modal sample real-value feature vectors in the common space and directly compute their similarity relationship based on the distances among different modal real-value feature vectors. The deep supervised cross-modal retrieval algorithm [1] learns the common space for both image and text and uses the weight sharing strategy to eliminate the differences between heterogeneous data. Kang et al. [27] utilize a unified framework to generate the image and text features, which can minimize the semantic inconsistency and the intra-class low-rank regularization value. As described above, most algorithms employ the real-value feature vectors to represent different modal content in the common space. However, the time complexity for computing the distances among real-value feature vectors is high, which cannot quickly respond to the large-scale cross-modal retrieval task.

To solve the above problem, the hash-based cross-modal retrieval algorithms map the real-value feature vectors into compact binary codes and retrieve the nearest neighbors according to the Hamming distance. It has the advantages of low storage space and high retrieval efficiency.

Generally, the hash-based cross-modal retrieval algorithm can be roughly divided into the supervised and unsupervised cross-modal hashing algorithms. The supervised cross-modal hashing algorithms [2,21,22,23,24,25,26] usually take the semantic label as supervision information and aim to preserve the original semantic similarity relationship among different modal samples in the Hamming space. Semantic preserving hashing [26] converts the semantic similarity preserving problem into the probability distribution consistency problem and learns the hash code by minimizing the Kullback–Leibler divergence. Zhan et al. [24] establish a semantic hierarchy to preserve the semantic similarity among intra-layer samples and the correlation among inter-layer samples. The above supervised cross-modal hashing algorithms can well preserve the semantic similarities during the generation of binary feature vectors. However, fewer training samples have semantic labels, and it costs too much to manually annotate semantic labels.

The unsupervised cross-modal hashing algorithm [8,9,10,14,15,16,17,18,28,29,30,31,32] does not involve semantic labels when learning binary feature vectors, which utilizes the distances between different modal feature representations to indicate their similarity relationship. Traditionally, unsupervised cross-modal hashing algorithms [29,30,31,32] compute the similarity relationship matrices based on the hand-crafted SIFT feature [33] and demand the original similarity relationship matrix between different modal samples in the common space to be preserved. Inter-media hashing [29] introduces both the inter- and intra-modal consistency constraints when mapping multimedia features into the common Hamming space. Latent semantic sparse hashing [31] captures the image structural information by sparse coding and utilizes the matrix factorization to learn the text latent semantic information. For the traditional cross-modal retrieval algorithms, it is difficult for the hand-crafted features to capture the abstract semantic relationships of the image [8], and the linear transformation can hardly model the complex nonlinear relations among multi-modal data [1].

As the deep neural network can generate excellent different modal features, it has been widely applied to unsupervised cross-modal hashing algorithms [15,16,17,18,19,20,28]. Wu et al. [28] used the convolutional neural network to jointly learn the real-value feature vectors and the corresponding binary code and employ the binary potential factor model to avoid the relaxation procedure of the binary encoding procedure. Hu et al. [9] propose a cross-modal hashing knowledge distillation framework, which establishes a similarity matrix between different modal samples in the teacher model and utilizes a lightweight network to separately generate image and text features. Zhang et al. [34] propose an end-to-end framework M2GUDA by stacking GCN and GAN models. M2GUDA uses the convolutional graph network to mine the deep semantic and structural information of both image and text.

The existing unsupervised cross-modal retrieval algorithms still have many problems: (1) many algorithms do not preserve both the intra- and inter-modal similarity relationship well, which may lead to an inferior cross-modal search performance [27,35]. (2) Usually, the cross-modal retrieval algorithms do not take relative similarity preservation into consideration [1,36,37], while the nearest neighbor search task emphasizes the ranking orders of retrieval samples. (3) The widely used pair-wise similarity preserving constraint demands that similar data pairs should have minimal distance values, which may assign the same code to most samples and lead to the imbalanced code problem [38,39,40]. (4) The encoding results have weak robustness, and the retrieval results are easily disturbed by the noise.

To solve the above problems, we propose an unsupervised deep relative neighbor relationship preserving cross-modal hashing (DRNPH), and the framework is shown in Figure 1. DRNPH uses the VGG19 network to learn the image feature and employs the bag-of-words model to generate the text feature. Then, we build a hash layer to generate different modal binary feature vectors. According to the generated floating-point features, we separately calculate the image neighbor matrix S^I, the text neighbor matrix S^T and the inter-modal neighbor matrix S^IT. Furthermore, we establish three objective functions, including the nearest neighbor similarity preserving function, the cross-modal pair-wise relative similarity preserving function and the cross-modal triple relative similarity preserving function. During the training process, we simultaneously minimize the above three loss functions to preserve different modal neighbor structures and relative similarity relationships in the common Hamming space. The advantages of our proposed method compared with other similar works as shown in Table 1.

The main contributions of this paper are as follows:

We demand the generated binary feature vectors should reconstruct both the original intra- and inter-modal neighbor structure, which guarantees the nearest neighbor retrieval results in the Hamming space are consistent with those in the original space.
The proposed intra- and inter-modal pair-wise similarity preserving functions, which guarantee the similar samples have the same Hamming distance value to the anchor, can assign the same binary code to similar samples.
For the triplet samples with different modalities, the cross-modal triple relative similarity preserving function demands that similar data pairs should have minimal distance values than those between dissimilar data pairs, which can avoid the imbalance code problem. Furthermore, it helps make the ranking orders of the retrieval results in the Hamming space identical to those in the original space.

The rest of this paper is organized as follows: Section 2 introduces the detail of the proposed DRNPH algorithm. Section 3 presents the comparative experimental results and analysis. Section 4 summarizes this paper.

2. The Proposed Method

2.1. Cross-Modal Retrieval Algorithm

The cross-modal retrieval algorithms can return the nearest neighbor with different modalities from the query sample. As shown in Figure 2, the cross-modal retrieval tasks include image retrieving text and text retrieving image.

In this paper, O = {o_i}ⁿ_i=₁ denotes the dataset containing n sample pairs and o_i = (I_i,T_i) represents the image–text pair. F_i^I is the image’s real-value feature vectors. Correspondingly, F_i^T is the real-value feature vectors of the text. We map F_i^I and F_i^T to the compact binary feature vectors B^I and B^T, respectively. Then, we quickly respond to the large-scale cross-modal retrieval task according to the Hamming distance.

Thus, to achieve a better cross-modal retrieval performance, the generated binary feature vectors should reconstruct the original intra- and inter-modal neighbor structure in the Hamming space. Furthermore, we define the pair-wise and triplet similarity preserving functions to boost the cross-modal retrieval performance.

2.2. Unsupervised Deep Relative Neighbor Relationship Preserving Cross-Modal Hashing

In this section, we establish deep neural networks to learn different modal real-value features and map them into compact binary codes. During the training process, to guarantee the cross-modal retrieval performance obtained in the common Hamming space, we define three loss functions, including the neighbor structure-preserving function, cross-modal pair-wise similarity preserving function and cross-modal triple relative similarity preserving function.

2.2.1. Deep Learning Binary Feature Vectors

In this paper, we separately establish two deep neural networks to learn the image and text binary feature vectors.

For image modality, we reconstruct the VGG-19 network and use the 4096-dimensional vector output from the f_C₇ layer as the real-value feature vectors F_i^I. The VGG-19 network outputs 4096 dimensional vectors by default in the case of feature extraction. In addition, many existing works [44,45,46,47] have proved that the default setting owns the best performance. Thus, our proposed method sets the dimension of the feature vector to 4096. Then, we utilize a hash layer instead of the classifier f_C₈ layer, which maps the image feature F_i^I into K-dimensional vectors H_i^I, as shown in Equation (1). θ^I represents the ImgNet parameters. The parameters of ImgNet include the parameters in the VGG-19 network and three fully connected layers, which are the weights of the convolution core and the biases of all channels.

H_{i}^{I} = {ImgNet (I}_{i}, θ^{I})

(1)

Then, we use the sign function to obtain the image binary feature vectors B_i^I, as in Equation (2).

\begin{array}{l} B_{i}^{I} = s i g n (H_{i}^{I}) \\ B^{I} \in {- 1, + 1}^{n \times K} \end{array}

(2)

For the text modality, we employ the bag of words model and the multi-layer perception to learn the 4096-dimensional text feature F_i^T. Similarly, we establish a hash layer to map the text feature F_i^T to the K-dimensional vector H_i^T, as shown in Equation (3). θ^T represents the TxtNet parameters. The parameters of TxtNet include the parameters of three fully connected network layers, which are the weights and bias of all neural cells. During the training process, we employ the gradient descent mechanism to update the parameters in both ImgNet and TxtNet.

H_{i}^{T} = {TxtNet (T}_{i}, θ^{T})

(3)

Then, we use the sign function to obtain the text binary feature vectors B_i^T, as shown in Equation (4).

\begin{array}{l} B_{i}^{T} = s i g n (H_{i}^{T}) \\ B^{T} \in {- 1, + 1}^{n \times K} \end{array}

(4)

2.2.2. Neighbor Structure Preserving

Generally, to guarantee the nearest neighbor retrieval performance in the Hamming space, we require the generated binary feature vectors to reconstruct the original neighbor structure in the Hamming space [8]. For the cross-modal retrieval task, we separately establish the intra- and inter-modal neighbor matrices according to the real-value features of the image and text.

There are two kinds of intra-modal neighbor matrices, including the image neighbor matrix S^I and the text neighbor matrix S^T.

S_ij^I∈[−1,+1] is the (i, j) element in the image neighbor matrix S^I, and its value represents the cosine similarity between images i and j, as defined in Equation (5). These matrices are symmetrical across the main diagonal. For example, S^I(m, n) = S^I(n, m). However, they have different meanings. S^I(m, n) represents the similarity degree between the query sample m and the database sample n. In contrast, n is the query sample and m is the database sample in S^I(n, m).

S_{i j}^{I} = \cos ({\tilde{F}}_{i}^{I}, {\tilde{F}}_{j}^{I}) = \frac{{\tilde{F}}_{i}^{I} ({\tilde{F}}_{j}^{I})^{'}}{‖ {\tilde{F}}_{i}^{I} ‖ ‖ {\tilde{F}}_{j}^{I} ‖} \in [- 1, + 1]

(5)

In Equation (5), (·)′ is the transpose of the matrix.

\tilde{F}

_i^I is the normalized value of the image’s real-value feature vectors F_i^I, which makes different modal features have the same value range. The formulation for normalizing the feature vector F_i^I is shown in Equation (6), where p is the exponent value in the norm formulation, whose default value is 2. ε is a small value to avoid division by zero, whose default value is 1 × 10⁻¹². This measure can also accelerate the convergence speed of the gradient descent optimization procedure.

\tilde{F_{i}^{I}} = \frac{F_{i}^{I}}{\max ({‖ F_{i}^{I} ‖}_{p}, ε)}

(6)

Similarly, S_ij^T∈[−1,+1] is the (i, j) element in the text neighbor matrix S^T, and its value denotes the cosine similarity between texts i and j, as defined in Equation (7).

\tilde{F}

_i^T is the normalized value of the text’s real-value feature vectors F_i^T.

S_{i j}^{T} = \cos ({\tilde{F}}_{i}^{T}, {\tilde{F}}_{j}^{T}) = \frac{{\tilde{F}}_{i}^{T} ({\tilde{F}}_{j}^{T})^{'}}{‖ {\tilde{F}}_{i}^{T} ‖ ‖ {\tilde{F}}_{j}^{T} ‖} \in [- 1, + 1]

(7)

Based on the obtained S^I and S^T, we further define the inter-modal neighbor matrix S^IT as in Equation (8), which denotes the neighbor relationship between different modal samples. The parameters α, β and μ indicate the weight value of each modal neighbor information.

S^{I T} = α S^{I} + β S^{T} + μ \cos (S^{I}, S^{T}), s . t . α + β + μ = 1, α, β, μ \geq 0, S^{I T} \in [- 1, + 1]

(8)

In this paper, we define the inter-modal neighbor structure-preserving function as in Equation (9), which demands that the neighbor relationship computed based on the binary feature vectors should be consistent with the original inter-modality neighbor relationship.

L_{1} = \min_{B_{I}, B_{T}} η_{1} {‖ γ_{1} S^{I T} - \cos (B_{I}, B_{I}) ‖}_{F}^{2} + {‖ γ_{1} S^{I T} - \cos (B_{I}, B_{T}) ‖}_{F}^{2} + η_{2} {‖ γ_{1} S^{I T} - \cos (B_{T}, B_{T}) ‖}_{F}^{2}

(9)

In Equation (9), the parameters η₁ and η₂ balance the reconstruction loss between the inter- and intra-modal neighbor matrices. The parameter

γ_{1}

can make the values in the Hamming neighbor matrix have the same scale as those in S^IT. We calculate cos(B_I, B_I) as defined in Equations (10) and (11) and, thus, cos(B_I, B_T), cos(B_T, B_T) as well.

c \tilde{o} s (B_{I_{i}}, B_{I_{j}}) = \frac{B_{I_{i}} (B_{I_{j}})^{'}}{‖ B_{I_{i}} ‖ ‖ B_{I_{j}} ‖}

(10)

\cos (B_{I}, B_{I}) = [\begin{matrix} c \tilde{o} s (B_{I_{1}}, B_{I_{1}}) & c \tilde{o} s (B_{I_{1}}, B_{I_{2}}) & \dots & c \tilde{o} s (B_{I_{1}}, B_{I_{n}}) \\ c \tilde{o} s (B_{I_{2}}, B_{I_{1}}) & c \tilde{o} s (B_{I_{2}}, B_{I_{2}}) & \dots & c \tilde{o} s (B_{I_{2}}, B_{I_{n}}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ c \tilde{o} s (B_{I_{n}}, B_{I_{1}}) & c \tilde{o} s (B_{I_{n}}, B_{I_{2}}) & \dots & c \tilde{o} s (B_{I_{n}}, B_{I_{n}}) \end{matrix}]

(11)

Similarly, we define the intra-modal neighbor structure-preserving function as in Equation (12). It requires that both the image and text modal neighbor structures obtained in the Hamming space should be consistent with those in the Euclidean space.

L_{2} = \min_{B_{I}, B_{T}} {‖ γ_{2} S^{I} - \cos (B_{I}, B_{I}) ‖}_{F}^{2} + {‖ γ_{2} S^{T} - \cos (B_{T}, B_{T}) ‖}_{F}^{2}

(12)

In Equation (12), the parameter

γ_{2}

makes the values in the Hamming neighbor matrix have the same scale as those in S^I and S^T.

During the training process, we jointly minimize the value of Equations (9) and (12) to preserve both the inter- and intra-modal neighbor structure in the Hamming space.

For different data sets, the neighbor information of each modality is diverse. However, the fixed weight parameters in Equation (9) cannot self-adapt to different datasets, which makes the binary feature vectors learning process inflexible [32]. Fortunately, the intra-modal nearest neighbor structure matrix can alleviate the problem caused by Equation (12) by fine-tuning the generated binary feature vectors.

2.2.3. Cross-Modal Pair-Wise Similarity Preserving

For the cross-modal retrieval task, the ideal situation is different modal samples belonging to the same category and having identical binary code. To achieve this goal, this paper demands that sample pairs of the same category should have identical Hamming distances to the same anchor. According to the anchor and samples’ modalities, we separately design the intra- and inter-modal pair-wise similarity preserving function in this paper. Figure 3 shows the pair-wise similarity preserving constraint. The sample located at the arrow tail is the anchor. Correspondingly, the samples located at the arrow-heads should have the same Hamming distance to the anchor.

For the sample pairs o_i = (I_i,T_i) and o_j = (I_j,T_j), the intra-modal pair-wise similarity preserving constraint requires that the Hamming distance between the images I_i and I_j should be the same as that between the texts T_i and T_j, dis_H(B_i^I, B_j^I) = dis_H(B_i^T, B_j^T).

In Figure 3, the black arrow ① represents the intra-modal pair-wise similarity preserving constraint, which demands the Hamming distance between the car and ship images should be equal to that between the car and ship texts.

The inter-modality pair-wise similarity preserving constraint demands that the same category sample pairs that belong to different modalities should have the same Hamming distance, dis_H(B_i^I,B_j^T) = dis_H(B_i^T,B_j^I).

In Figure 3, the blue arrow ② represents the inter-modal pair-wise similarity preserving constraint, which demands the Hamming distance between the car image and the ship text should be equal to that between the car text and the ship image.

In addition, for the training pair o_i = (I_i,T_i), the image I_i and text T_i belong to the same category, and they should have the same binary feature vectors. Therefore, we minimize the Hamming distances of all training pairs, as defined in Equation (13), where E is an n × n identity matrix. Tr(

\cdot

) is the matrix trace.

L_{p a i r} = \min_{B^{I}, B^{T}} {‖ T r (\cos (B^{I}, B^{T}) - E) ‖}^{2}

(13)

To further enhance the pair-wise similarity preserving performance, we demand that the intra-modal similarity relationship be as the same as the inter-modal similarity relationship, dis_H(B_i^I,B_j^T) = dis_H(B_i^I,B_j^I), dis_H(B_i^I,B_j^T) = dis_H(B_i^T,B_j^T).

In Figure 3, the yellow arrow ③ and red arrow ④ show the above similarity preserving constraints. As a result, the Hamming distance between the car image and the ship text equals that between the car and ship images.

Finally, we define the cross-modal pair-wise similarity preserving function as in Equation (14).

\begin{array}{l} L_{p} & = \min_{B^{I}, B^{T}} {‖ \cos (B^{I}, B^{T}) - \cos (B^{I}, B^{I}) ‖}^{2} + {‖ \cos (B^{I}, B^{T}) - \cos (B^{T}, B^{T}) ‖}^{2} \\ + {‖ \cos (B^{I}, B^{I}) - \cos (B^{T}, B^{T}) ‖}^{2} + {‖ \cos (B^{I}, B^{T}) - {(\cos (B^{I}, B^{T}))}^{T} ‖}^{2} + {‖ T r (\cos (B^{I}, B^{T})) - E ‖}^{2} \end{array}

(14)

2.2.4. Cross-Modal Triple Relative Similarity Preserving

The pair-wise similarity preserving function would lead to an imbalanced code, and all samples may have the same binary code. To fix the above problem, we propose the cross-modal triple relative similarity preserving function to avoid all sample pairs having the same Hamming distance.

We define the cross-modal triple relative similarity preserving constraints in Equations (15) and (16). This demands that the Hamming distance between the similar pair should be less than that between the dissimilar pair, and the sample pairs should belong to different modalities.

L_{t r i p l e t_I} = \sum_{i, j, k} \max ({‖ B_{i}^{I} - B_{j^{+}}^{T} ‖}_{2}^{2} - {‖ B_{i}^{I} - B_{k^{-}}^{T} ‖}_{2}^{2} + ϕ, 0)

(15)

L_{t r i p l e t_T} = \sum_{i, j, k} \max ({‖ B_{i}^{T} - B_{j^{+}}^{I} ‖}_{2}^{2} - {‖ B_{i}^{T} - B_{k^{-}}^{I} ‖}_{2}^{2} + ϕ, 0)

(16)

In Equations (15) and (16), φ is the similarity threshold. B^I_i⁺ is the binary feature vectors of the image I_i. B^T_j⁺ is the binary feature vectors of the text sample, which is similar to image I_i. B^T_k⁻ is the binary feature vectors of the text that are dissimilar to the image I_k. B_Ti is the binary feature vectors of the text T_i. B^I_j⁺ is the binary feature vectors of the image I_j that are similar to the text T_i. B^I_k⁻ is the binary feature vectors of the image I_k, which are dissimilar to the text T_i.

When φ is larger than 0, the cross-modal triple relative similarity preserving constraint can make the dissimilar and similar samples have different binary feature vectors. In addition, this constraint focuses on distinguishing the relative similarity relationship between different modal samples. Thus, it can guarantee the ranking orders of the cross-modal retrieval results in the Hamming space are consistent with those in the original space. Furthermore, the cross-modal triplet relative similarity preserving constraint boosts the proposed algorithm’s robustness to noise. For example, we set the similarity threshold as 2, and the query sample’s binary feature vector is 000. When the similar sample’s code is 001, we should assign the dissimilar sample as 111 to satisfy the cross-modal triplet similarity preserving constraints. As a result, even though the noise changes the similar sample’s code to 010, 100, 101 and 110, we can still correctly retrieve the similar sample.

Finally, the objective function of the proposed Unsupervised Deep Relative Neighbor Relationship Preserving Cross-Modal Hashing algorithm is defined as in Equation (17):

L = L_{1} + ω_{1} L_{2} + L_{pair} + ω_{2} L_{a l i g n} + ω_{3} (L_{t r i p l e t_I} + L_{t r i p l e t_T})

(17)

where ω₁, ω_2, ω₃ are the weight values.

2.2.5. Optimization

In this paper, to minimize the value of the loss function defined in Equation (17), we optimize the parameters of the ImgNet and TxtNet by the stochastic gradient descent mechanism. However, the Hamming distances are computed based on the discrete binary feature vectors, which are generated by the sign function. Unfortunately, it cannot backward propagate the gradient value of the loss function and causes the gradient disappearance problem. As a result, we cannot iteratively optimize the deep neural network’s parameters.

To solve the above problem, we use the continuous tanh (

\cdot

) function [8,11] instead of the sign function, as in Equation (18):

B = s g n (H) = \lim_{σ \to \infty} \tanh (σ H)

(18)

During the training procedure, we gradually increase the value of σ. When σ→

\infty

, the tanh(

\cdot

) function converges to the sign function, and the quantization error between the binary feature vectors and its relaxed values is minimized.

The training procedure of the proposed Unsupervised Deep Relative Neighbor Relationship Preserving Cross-Modal Hashing is shown in Algorithm 1.

Algorithm 1 Learning of our DRNPH

Input:
Training set O = {I_i,T_i}ⁿ_i=₁; Batch size m; Max training epoch E; Hash code length k; ImgNet and TxtNet with the parameters

θ_{I}

,

θ_{T}

; The parameters:

α, β, μ, γ_{1}, γ_{2}, φ, η_{1}, η_{2}

.
Output:
Hashing functions ImgNet(I_i,

θ_{I}

) and TxtNet(T_i,

θ_{T}

);
1. Initialize epoch h = 0;
2. Repeat
3. h = h + 1;

σ = \sqrt{h}

4. For i = 1:

⌊ n / E ⌋

do
5. Randomly select a small training batch from O_E = {I_i, T_i}^E_i=₁.
6. Generate the image feature F_i^I using the image network.
7. Generate the text feature F_i^T using the text network.
8. Calculate S^I, S^T, S^IT by Equations (5), (7) and (8), respectively.
9. Compute the binary feature vectors B = tanh(

σ

H).
10.          Compute the loss value L by Equation (17).
11.          Optimize the network parameters by stochastic gradient descent mechanism.
12.      end for
13. until convergence

3. Experiments

3.1. Datasets

We conducted the cross-modal nearest neighbor retrieval comparative experiments on two commonly used public datasets, MIRFlickr and NUS-WIDE.

MIRFlickr [48] contains 25,000 image-text pairs with a total of 24 classes. We randomly selected 2000 image-text pairs as the query samples. Then, the remaining 23,000 image-text pairs were utilized for validation and 5000 pairs were employed for the training network.

NUS-WIDE [49] consists of 186,577 image-text pairs in total. We randomly selected 2000 image-text pairs for querying, 184,577 pairs for validation, and 5000 pairs for training.

3.2. Experimental Setting

In this paper, we employed the VGG-19 pre-trained on the ImageNet dataset [50] as the ImgNet backbone and utilized the tanh(∙) function to generate binary codes. Correspondingly, TxtNet utilizes the bag of words model and the multi-layer perceptron (MLP) as the backbone, and the MLP utilizes the ReLU as the activation function.

We use the mini-batch SGD to optimize the network parameters. The training batch size is 32. We set the momentum as 0.9 and the weight decay as 0.0005. The initial learning rates of ImgNet and TextNet are 0.001 and 0.01, respectively. We implement the proposed algorithm by the PyTorch V1.11.0 deep learning framework on an Ubuntu server (CPU Intel Xeon Gold 6242R 3.10 GHz, 64 GB RAM, GPU Tesla T4 16 GB).

3.3. Evaluation Metrics

We evaluate the cross-modal nearest neighbor retrieval performance using the mean Average Precision (mAP) and the precision@top-N curve [8,11,13].

The definition of mAP is shown in Equation (19), which represents the mean value of the average precision of the nearest neighbor search results.

mAP = \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{l} \sum_{m = 1}^{R} P (m) \times r e l_{m}

(19)

In Equation (19), n is the query set size. R is the number of retrieved samples, and its value is 50. l is the number of positive samples. P(m) is the accuracy of retrieving m samples. When the m-th retrieved sample is positive, rel_m = 1, otherwise, rel_m = 0.

The Top-N precision curve represents the precision value of retrieving a different number of samples.

3.4. Experimental Results and Analysis

We compare the performance of the cross-modal nearest neighbor retrieval tasks, including image retrieving text and text retrieving image. The comparison algorithms are as follows:

IMH [29] learns hash functions by a linear regression model, which can map different modal data into the common Hamming space.

LCMH [30] proposes a scalable index retrieval approach, which has a linear time complexity.

CMFH [14] learns the uniform binary feature vectors for different modalities by the collective matrix factorization of the latent factor models.

LSSH [31] uses sparse matrices and matrix factorization to retrieve cross-modal nearest neighbors.

RFDH [32] proposes the robust and flexible discrete hashing, which assigns a larger weight value to the more important modality.

DBRC [51] proposes a deep binary reconstruction model, which aims to preserve the inter-modal correlation.

UDCMH [28] jointly uses the deep learning framework, the matrix factorization and the binary correlation factor model to achieve the cross-modal retrieval task.

DJSRH [8] proposes a joint semantic affinity matrix, which is well suited for batch training, to reconstruct hash codes.

JDSH [11] proposes a sampling and weighting mechanism, which computes the similarity decision weight values based on the sample distribution.

PSTH [7] employs the pair-wise similarity constraint to preserve the original neighborhood relationship in the Hamming space.

The mAP@50 values of the cross-modal retrieval performance on MIRFlickr and NUS-WIDE datasets are shown in Table 2 and Table 3. The greater mAP score represents a better retrieval performance. Table 2 shows the performance of the image retrieving text task. Table 3 shows the performance of the text retrieving image task. I→T represents the task of image retrieving text and T→I represents the task of text retrieving image.

On the MIRFlickr dataset, our method significantly outperforms the other 10 methods. Compared with the state-of-the-art PSTH method, DRNPH improves the mAP@50 value of the image retrieving text performance by 3%, 3.4%, and 3.8% with 32-, 64-, and 128-bit binary code, respectively.

For the NUS-WIDE dataset, 65% of the text information is noise [49], which results in the inferior cross-modal performance of many algorithms. However, the proposed method DRNPH still achieves a better performance as it improves the noise robustness by using a triplet similarity preserving constraint.

Figure 4 and Figure 5 show the top-N precision curves on the MIRFlickr and NUS-WIDE datasets, respectively. The length of the binary code is 128. The precision of the proposed DRNPH is better than the other 10 methods on both datasets.

IMH and LCMH learn hash functions by eigenvalue decomposition, which may lead to a worse cross-modal retrieval performance as the binary code length increases [32]. CMHF utilizes the collective matrix factorization to learn different modal unified hash codes. To improve the performance of CMHF, LSSH proposes to jointly utilize the sparse matrices and matrix decomposition to learn hash codes. However, the relaxation mechanisms in LSSH usually result in large quantization errors during the binarization process. In contrast, unsupervised RFDH directly optimizes the discrete binary codes, which can avoid the quantization error caused by the relaxation mechanism. However, RFDH does not preserve the original intra- and inter-modal neighbor structure in the Hamming space, and the same problem also exists in DBRC. DJSRH utilizes a joint semantic affinity matrix to reconstruct the deep hash codes. However, the redundant information in the joint semantic affinity matrix may lead to inferior performance of DJSRH. In contrast, JDSH can effectively reduce redundant information. Unfortunately, JDSH artificially sets the weight values for different modalities, which cannot self-adapt to the data distribution. In this paper, we conducted unsupervised deep learning of different modal binary feature vectors in the common space. To guarantee the cross-modal performance, we demand that the generated binary feature vectors simultaneously preserve the neighbor structure matrix, the cross-modal pair-wise similarity relationship and the cross-modal triplet relative similarity relationship. Thus, the proposed DRNPH method can make different modal nearest neighbors have similar binary feature vectors and guarantee that the cross-modal ranking orders in the Hamming space are consistent with those in the Euclidean space. Furthermore, DRNPH has improved the noise robustness. As a result, DRNPH can achieve the best cross-modal performance.

3.5. Ablation Experiments

To obtain an excellent cross-modal retrieval performance, we design three constraints, including the neighbor structure-preserving function, the cross-modal pair-wise similarity preserving function, and the cross-modal triple relative similarity preserving function. To demonstrate the effectiveness of the above three functions at improving the cross-modal nearest neighbor retrieval performance, we conducted the ablation experiments and designed four comparison algorithms as follows.

DRNPH-1 algorithm: only employs the inter-modal neighbor structure-preserving functions defined in Equations (8) and (9).

DRNPH-2 algorithm: improves DRNPH-1 by adding the intra-modal neighbor structure-preserving function defined in Equation (12).

DRNPH-3 algorithm: improves DRNPH-2 by employing Equation (13), which minimizes the Hamming distance between each training pair with different modalities.

DRNPH-4 algorithm: improves DRNPH-3 by using the cross-modal pair-wise similarity preserving function defined in Equation (14).

We compare the cross-modal retrieval performances of the proposed DRNPH and the above four algorithms on MIRFlickr and NUS-WIDE datasets, respectively. The mean Average Precision values are shown in Table 4 and Table 5. The number of the binary bits is 128.

The ablation experimental results show that the mean Average Precision value of the cross-modal retrieval performance gradually increases as employing more constraint functions. This demonstrates the effectiveness and necessity of each constraint function. DRNPH-2 achieves a slightly better performance than DRNPH-1, as it can preserve the intra-modal neighbor structure. DRNPH-3 further improves the cross-modal retrieval performance by preserving the inter-modal neighbor structure, which satisfies the need for the cross-modal retrieval task. The pair-wise similarity preserving function makes similar samples have the same binary code and helps DRNPH-4 improve the mAP value of image retrieving text and text retrieving image by 1.5% and 1.6% on the MIRFlickr dataset, respectively. DRNPH achieves the best performance, as the cross-modal triplet similarity preserving function can avoid the imbalance code in DRNPH-4 and enhance the relative similarity preserving performance.

3.6. The Parameter Setting Experiments

In this paper, we set the comparative experiments to choose the weight values in Equations (8), (9), (12) and (17). Figure 6 shows the MAP@50 value of both image retrieving text and text retrieving image tasks with different weight values on the NUS-WIDE dataset. The parameters α, β and μ in Equation (8) indicate the weight value of each modal neighbor information. In Figure 6a, we achieve the cross-modal retrieval performance with different value combinations, and we obtain the best cross-modal retrieval results when α = 0.4, β = 0.2, μ = 0.4.

In Equation (9), the parameters η₁ and η₂ balance the reconstruction loss between the inter- and intra-modal neighbor matrices, and parameter γ₁ makes the values in the Hamming neighbor matrix have the same scale as those in S_IT. Figure 6b,c shows that we can obtain the best cross-modal retrieval results by setting η₁ = 0.1, η₂ = 0.1 and γ₁ = 1.5.

In Equation (12), the parameter γ₂ makes the values in the Hamming neighbor matrix have the same scale as those in S_I and S_T. When γ₂ = 1.5, we can obtain the best cross-modal retrieval results, as shown in Figure 6d.

In Equation (17), the parameter ω₁, ω₂, ω₃ balances the weight value among different loss functions. The best cross-modal retrieval performance appears when ω₁ = 0.1, ω₂ = 0.8, ω₃ = 0.6, as shown separately in Figure 6e–g.

4. Conclusions

In this paper, we propose a novel Unsupervised Deep Relative Neighbor Relationship Preserving Cross-Modal Hashing (DRNPH), which can effectively improve the cross-modal retrieval performance in the Hamming space. We utilize the pre-trained VGG19 to generate real-value image feature vectors and employ the bag-of-words and multi-layer perception to learn the real-value text feature vectors. Then, we map different modal features into the common Hamming space and achieve the cross-modal nearest neighbor search task based on the Hamming distance. In this paper, to obtain an excellent cross-modal retrieval performance in the Hamming space, we design three loss functions. Firstly, we establish the intra- and inter-modal neighbor matrix preserving functions to reconstruct the neighbor matrix in the Hamming space, which can well preserve the original neighbor similarity relationship. Secondly, the cross-modal pair-wise similarity preserving function demands that similar samples should have the same Hamming distance to the same anchor. Thus, similar samples possess the same binary feature vectors, and they have minimal Hamming distances. However, the pair-wise similarity preserving constraint may lead to an imbalanced code problem, and all samples have the same binary code. To fix this problem, we propose the cross-modal triplet relative similarity preserving constraint, which demands that the Hamming distances among similar samples should be less than those among the dissimilar samples. The triplet relative similarity preserving constraint helps to distinguish the ranking orders. Furthermore, it can boost the noise robustness by setting a large similarity marginal. We conduct the cross-modal retrieval comparative experiments and the ablation study on both MIRFlickr and NUS-WIDE datasets. The ablation experiments verify that all three constraints are necessary for improving the cross-modal retrieval performance. The final comparative experimental results show that the proposed DRNPH method outperforms the other ten state-of-the-art methods. The proposed deep learning architecture contains a large number of parameters, which causes a higher training time complexity. In the future, we will employ the knowledge distillation mechanism to generate a lightweight network.

Author Contributions

Conceptualization, Z.W. and X.Y.; methodology, Z.W. and X.Y.; software, G.L.; validation, N.W., C.F. and P.L.; formal analysis, Z.W. and X.Y.; investigation, Z.W. and P.L. resources, G.L.; data curation, C.F.; writing—original draft preparation, X.Y.; writing—review and editing, Z.W.; visualization, N.W. and X.Y.; supervision, Z.W.; project administration, Z.W.; funding acquisition, Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 61841602, the Natural Science Foundation of Shandong Province of China, grant number ZR2018PF005 and ZR2021MF017, the Youth Innovation Science and Technology Team Foundation of Shandong Higher School, grant number 2021KJ031 and the Fundamental Research Funds for the Central Universities, JLU, grant number 93K172021K12.

Acknowledgments

The authors express their gratitude to the institutions that supported this research: Shandong University of Technology (SDUT) and Jilin University (JLU).

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhen, L.; Hu, P.; Wang, X.; Peng, D. Deep Supervised Cross-Modal Retrieval. In Proceedings of the Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Yu, J.; Wu, X.-J.; Kittler, J. Discriminative Supervised Hashing for Cross-Modal Similarity Search. Image Vis. Comput. 2019, 89, 50–56. [Google Scholar] [CrossRef] [Green Version]
Li, D.; Dimitrova, N.; Li, M.; Sethi, I.K. Multimedia content processing through cross-modal association. In Proceedings of the International Conference on Multimedia, Berkeley, CA, USA, 2–8 November 2003. [Google Scholar]
Rasiwasia, N.; Pereira, J.C.; Coviello, E.; Doyle, G.; Lanckriet, G.R.G.; Levy, R.; Vasconcelos, N. A new approach to cross-modal multimedia retrieval. In Proceedings of the International Conference on Multimedia, ACM, Florence, Italy, 25–29 October 2010. [Google Scholar]
Kan, M.; Shan, S.; Zhang, H.; Lao, S.; Chen, X. Multi-view discriminant analysis. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012. [Google Scholar]
Wang, K.; He, R.; Wang, W.; Wang, L.; Tan, T. Learning coupled feature spaces for Cross-modal matching. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 1–8 December 2013. [Google Scholar]
Kang, P.; Lin, Z.; Yang, Z. Pairwise similarity transferring hash for unsupervised cross-modal retrieval. Comput. Appl. Res. 2021, 38, 3025–3029. [Google Scholar]
Su, S.; Zhong, Z.; Zhang, C. Deep Joint-Semantics Reconstructing Hashing for Large-Scale Unsupervised Cross-Modal Retrieval. In Proceedings of the International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
Hu, H.; Xie, L.; Hong, R.; Tian, Q. Creating Something from Nothing: Unsupervised Knowledge Distillation for Cross-Modal Hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Yu, J.; Zhou, H.; Zhan, Y.; Tao, D. Deep Graph-neighbor Coherence Preserving Network for Unsupervised Cross-modal Hashing. In Proceedings of the Conference on Artificial Intelligence, Virtual Event, 2–9 February 2021. [Google Scholar]
Liu, S.; Qian, S.; Guan, Y.; Zhan, J.; Ying, L. Joint-modal Distribution-based Similarity Hashing for Large-scale Unsupervised Deep Cross-modal Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, China, 25–30 July 2020. [Google Scholar]
Wang, K.; He, R.; Wang, L.; Wang, W.; Tan, T. Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 2010–2023. [Google Scholar] [CrossRef] [PubMed]
Li, C.; Deng, C.; Wang, L.; Xie, D.; Liu, X. Coupled CycleGAN: Unsupervised Hashing Network for Cross-Modal Retrieval. In Proceedings of the Thirty-First Innovative Applications of Artificial Intelligence Conference, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar]
Ding, G.; Guo, Y.; Zhou, J. Collective matrix factorization hashing for multimodal data. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Chen, D.; Cheng, M.; Min, C.; Jing, L. Unsupervised Deep Imputed Hashing for Partial Cross-modal Retrieval. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020. [Google Scholar]
Zhang, J.; Peng, Y. Multi-Pathway Generative Adversarial Hashing for Unsupervised Cross-Modal Retrieval. IEEE Trans. Multimed. 2020, 22, 174–187. [Google Scholar] [CrossRef]
Tuan, H.; Do, T.-T.; Nguyen, T.V.; Cheung, N.-M. Unsupervised Deep Cross-modality Spectral Hashing. IEEE Trans. Image Process. 2020, 29, 8391–8406. [Google Scholar]
Shen, X.; Zhang, H.; Li, L.; Liu, L. Attention-Guided Semantic Hashing for Unsupervised Cross-Modal Retrieval. In Proceedings of the International Conference on Multimedia and Expo, Shenzhen, China, 5–9 July 2021. [Google Scholar]
Wang, C.; Yang, H.; Meinel, C. Deep semantic mapping for cross modal retrieval. In Proceedings of the 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI), Vietri sul Mare, Italy, 9–11 November 2015. [Google Scholar]
Castrejon, L.; Aytar, Y.; Vondrick, C.; Pirsiavash, H.; Torralba, A. Learning aligned cross modal representations from weakly aligned data. In Proceedings of the Computer Vision and Pattern Recognition, LasVegas, NV, USA, 27–30 June 2016; pp. 2940–2949. [Google Scholar]
Lin, Z.; Ding, G.; Han, J.; Wang, J. Cross view retrieval via probability-based semantics preserving hashing. IEEE Trans. Cybern. 2017, 47, 4342–4355. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Shen, H.T.; Liu, L.; Yang, Y.; Xu, X.; Huang, Z.; Shen, F.; Hong, R. Exploiting Subspace Relation in Semantic Labels for Cross-Modal Hashing. IEEE Trans. Knowl. Data Eng. 2021, 33, 3351–3365. [Google Scholar] [CrossRef]
Wang, L.; Zareapoor, M.; Yang, J.; Zheng, Z. Asymmetric Correlation Quantization Hashing for Cross-Modal Retrieval. IEEE Trans. Multimed. 2020. [Google Scholar] [CrossRef]
Zhan, Y.; Luo, X.; Wang, Y.; Xu, X.-S. Supervised Hierarchical Deep Hashing for Cross-Modal Retrieval. In Proceedings of the MM ’20: The 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020. [Google Scholar]
Wang, J.; Li, G.; Pan, P.; Zhao, X. Semi-supervised semantic factorization hashing for fast cross-modal Retrieval. Multimed. Tools Appl. 2017, 76, 20197–20215. [Google Scholar] [CrossRef]
Lin, Z.; Ding, G.; Hu, M.; Wang, J. Semantics-preserving hashing for cross-view retrieval. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Kang, P.; Lin, Z.; Yang, Z.; Fang, X.; Bronstein, A.M.; Li, Q.; Liu, W. Intra-class low-rank regularization for supervised and semi-supervised cross-modal retrieval. Appl. Intell. 2022, 52, 33–54. [Google Scholar] [CrossRef]
Wu, G.; Lin, Z.; Han, J.; Liu, L.; Ding, G.; Zhang, B.; Shen, J. Unsupervised Deep Hashing via Binary Latent Factor Models for Large-scale Cross-modal Retrieval. In Proceedings of the Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018. [Google Scholar]
Song, J.; Yang, Y.; Yang, Y.; Huang, Z.; Shen, H.T. Inter-media hashing for large-scale retrieval from heterogeneous data sources. In Proceedings of the ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 22–27 June 2013. [Google Scholar]
Zhu, X.; Huang, Z.; Shen, H.; Zhao, X. Linear cross-modal hashing for efficient multimedia search. In Proceedings of the ACM Multimedia Conference, Barcelona, Spain, 21–25 October 2013. [Google Scholar]
Zhou, J. Latent semantic sparse hashing for cross-modal similarity search. In Proceedings of the 37th international ACM SIGIR Conference on Research & Development in Information Retrieval, Gold Coast, QLD, Australia, 6–11 July 2014. [Google Scholar]
Wang, D.; Wang, Q.; Gao, X. Robust and Flexible Discrete Hashing for Cross-Modal Similarity Search. IEEE Trans. Circuits Syst. Video Technol. 2018, 28, 2703–2715. [Google Scholar] [CrossRef]
David, G.L. Object Recognition from Local Scale-Invariant Features. In Proceedings of the International Conference on Computer Vision, Kerkyra, Greece, 20–25 September 1999. [Google Scholar]
Zhang, C.; Zhong, Z.; Zhu, L.; Zhang, S.; Cao, D.; Zhang, J. M2GUDA: Multi-Metrics Graph-Based Unsupervised Domain Adaptation for Cross-Modal Hashing. In Proceedings of the International Conference on Multimedia Retrieval, Taipei, Taiwan, 21–24 August 2021. [Google Scholar]
Qiang, H.; Wan, Y.; Xiang, L.; Meng, X. Deep semantic similarity adversarial hashing for cross-modal retrieval. Neurocomputing 2020, 400, 24–33. [Google Scholar] [CrossRef]
Jin, L.; Li, Z.; Tang, J. Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals. IEEE Trans. Neural Netw. Learn. Syst. 2020, 1–14. [Google Scholar] [CrossRef] [PubMed]
Jiang, Q.; Li, W. Deep Cross-Modal Hashing. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Wang, Z.; Wu, N.; Yang, X.; Yan, B.; Liu, P. Deep Learning Triplet Ordinal Relation Preserving Binary Code for Remote Sensing Image Retrieval Task. Remote Sens. 2021, 13, 4786. [Google Scholar] [CrossRef]
Wang, Z.; Sun, F.; Zhang, L.; Liu, P. Minimal Residual Ordinal Loss Hashing with an Adaptive Optimization Mechanism. EURASIP J. Image Video Process. 2020, 2020, 10. [Google Scholar] [CrossRef]
Liu, H.; Ji, R.; Wu, Y.; Huang, F. Ordinal Constrained Binary Code Learning for Nearest Neighbor Search. In Proceedings of the Thirty-First Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
Zhang, J.; Peng, Y.; Yuan, M. Unsupervised Generative Adversarial Cross-modal Hashing. In Proceedings of the Thirty-Second {AAAI} Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Li, C.; Deng, C.; Li, N.; Liu, W.; Gao, X.; Tao, D. Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Zhan, Y.; Yu, J.; Yu, Z.; Zhang, R.; Tao, D.; Tian, Q. Comprehensive Distance-Preserving Autoencoders for Cross-Modal Retrieval. In Proceedings of the MM ’18: ACM Multimedia Conference, Seoul, Korea, 22–26 October 2018. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Siddan, G.; Palraj, P. Foetal neurodegenerative disease classification using improved deep ResNet classification based VGG-19 feature extraction network. Multimed. Tools Appl. 2022, 81, 2393–2408. [Google Scholar] [CrossRef]
Mu, Y.; Ni, R.; Zhang, C.; Gong, H.; Hu, T.; Li, S.; Sun, Y.; Zhang, T.; Guo, Y. A Lightweight Model of VGG-16 for Remote Sensing Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6916–6922. [Google Scholar]
Zhang, C.; Meng, D.; He, J. VGG-16 Convolutional Neural Network-Oriented Detection of Filling Flow Status of Viscous Food. J. Adv. Comput. Intell. Intell. Inform. 2020, 24, 568–575. [Google Scholar] [CrossRef]
Huiskes, M.J.; Lew, M.S. The MIR Flickr retrieval evaluation. In Proceedings of the International Conference on Multimedia Information Retrieval, Vancouver, BC, Canada, 30–31 October 2008. [Google Scholar]
Chua, T.; Tang, J.; Hong, R.; Li, H.; Luo, Z.; Zheng, Y. NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of the International Conference on Image and Video Retrieval, Santorini Island, Greece, 8–10 July 2009. [Google Scholar]
Olga, R.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar]
Hu, D.; Nie, F.; Li, X. Deep Binary Reconstruction for Cross-modal Hashing. IEEE Trans. Multimed. 2019, 21, 973–985. [Google Scholar] [CrossRef] [Green Version]

Figure 1. The framework of the unsupervised deep relative neighbor relationship that preserves cross-modal hashing.

Figure 2. The cross-modal retrieval task includes image retrieving text and text retrieving image.

Figure 3. The pair-wise similarity preserving constraint. The anchor is located at the arrow tail. Correspondingly, the samples located at the arrow heads should have the same Hamming distance to the anchor.

Figure 4. The top-N precision curves of an image retrieving text. The length of binary feature vectors is 128.

Figure 5. The top-N precision curves of text retrieving an image. The length of binary feature vectors is 128.

Figure 6. The mAP value of the proposed method’s cross-modal retrieval performances with different parameter values on the NUS-WIDE dataset. (a) The cross-modal retrieval performances with different α, β, μ values and the best performance is obtained when α = 0.4, β = 0.2, μ = 0.4. (b) The cross-modal retrieval performances with different η₁ and η₂ values and the best performance is obtained when η₁ = 0.1, η₂ = 0.1. (c) The cross-modal retrieval performances with different γ₁ values and the best performance is obtained when γ₁ = 1.5. (d) The cross-modal retrieval performances with different γ₂ values and the best performance is obtained when γ₂ = 1.5. (e) The cross-modal retrieval performances with different ω₁ values and the best performance is obtained when ω₁ = 0.1. (f) The cross-modal retrieval performances with different ω₂ values and the best performance is obtained when ω₂ = 0.6. (g) The cross-modal retrieval performances with different ω₃ values and the best performance is obtained when ω₃ = 0.8.

Table 1. The advantages of our proposed method compared with other similar works.

	Neighbor Structure Preserving	Cross-Modal Pair-Wise Similarity Preserving	Cross-Modal Relative Similarity Preserving
UGACH [41]	×	×	√
DCMH [37]	×	×	×
SSAH [42]	×	×	×
DSCMR [1]	×	√	×
DSSAH [35]	√	×	×
CDPAE [43]	×	√	×
UKD [9]	×	√	×
Our	√	√	√

Table 2. The mAP@50 of image retrieving text.

Task	Method	MIRFlickr				NUS-WIDE
Task	Method	16 Bits	32 Bits	64 Bits	128 Bits	16 Bits	32 Bits	64 Bits	128 Bits
$I \to T$	IMH	0.612	0.601	0.592	0.579	0.470	0.473	0.476	0.459
	LCMH	0.559	0.569	0.585	0.593	0.354	0.361	0.389	0.383
	CMFH	0.621	0.624	0.625	0.627	0.455	0.459	0.465	0.467
	LSSH	0.584	0.599	0.602	0.614	0.481	0.489	0.507	0.507
	RFDH	0.632	0.636	0.641	0.652	0.488	0.492	0.494	0.508
	DBRC	0.617	0.619	0.620	0.621	0.424	0.459	0.447	0.447
	UDCMH	0.689	0.698	0.714	0.717	0.511	0.519	0.524	0.558
	DJSRH	0.810	0.843	0.862	0.876	0.724	0.773	0.798	0.817
	JDSH	0.832	0.853	0.882	0.892	0.736	0.793	0.832	0.835
	PSTH	0.863	0.872	0.880	0.895	0.774	0.796	0.842	0.821
	DRNPH	0.876	0.902	0.914	0.933	0.790	0.811	0.826	0.837

Table 3. The mAP@50 of text retrieving image.

Task	Method	MIRFlickr				NUS-WIDE
Task	Method	16 Bits	32 Bits	64 Bits	128 Bits	16 Bits	32 Bits	64 Bits	128 Bits
$T \to I$	IMH	0.603	0.595	0.589	0.580	0.478	0.483	0.472	0.462
	LCMH	0.561	0.569	0.582	0.582	0.376	0.387	0.408	0.419
	CMFH	0.642	0.662	0.676	0.685	0.529	0.577	0.614	0.645
	LSSH	0.618	0.626	0.626	0.628	0.455	0.459	0.468	0.473
	RFDH	0.681	0.693	0.698	0.702	0.612	0.641	0.658	0.680
	DBRC	0.618	0.626	0.626	0.628	0.455	0.459	0.468	0.473
	UDCMH	0.692	0.704	0.718	0.733	0.637	0.653	0.695	0.716
	DJSRH	0.786	0.822	0.835	0.847	0.712	0.744	0.771	0.789
	JDSH	0.825	0.864	0.878	0.880	0.721	0.795	0.794	0.804
	PSTH	0.845	0.844	0.845	0.861	0.749	0.769	0.803	0.791
	DRNPH	0.860	0.872	0.885	0.897	0.780	0.795	0.804	0.811

Table 4. The mAP@50 of the image retrieving text task and the number of binary bits is 128.

Task	Method	MIRFlickr	NUS-WIDE
$I \to T$	DRNPH-1	0.899	0.807
	DRNPH-2	0.904	0.811
	DRNPH-3	0.910	0.815
	DRNPH-4	0.925	0.831
	DRNPH	0.933	0.837

Table 5. The mAP@50 of the text retrieving image task and the number of binary bits is 128.

Task	Method	MIRFlickr	NUS-WIDE
$T \to I$	DRNPH-1	0.868	0.792
	DRNPH-2	0.871	0.793
	DRNPH-3	0.875	0.796
	DRNPH-4	0.891	0.807
	DRNPH	0.897	0.811

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, X.; Wang, Z.; Wu, N.; Li, G.; Feng, C.; Liu, P. Unsupervised Deep Relative Neighbor Relationship Preserving Cross-Modal Hashing. Mathematics 2022, 10, 2644. https://doi.org/10.3390/math10152644

AMA Style

Yang X, Wang Z, Wu N, Li G, Feng C, Liu P. Unsupervised Deep Relative Neighbor Relationship Preserving Cross-Modal Hashing. Mathematics. 2022; 10(15):2644. https://doi.org/10.3390/math10152644

Chicago/Turabian Style

Yang, Xiaohan, Zhen Wang, Nannan Wu, Guokun Li, Chuang Feng, and Pingping Liu. 2022. "Unsupervised Deep Relative Neighbor Relationship Preserving Cross-Modal Hashing" Mathematics 10, no. 15: 2644. https://doi.org/10.3390/math10152644

APA Style

Yang, X., Wang, Z., Wu, N., Li, G., Feng, C., & Liu, P. (2022). Unsupervised Deep Relative Neighbor Relationship Preserving Cross-Modal Hashing. Mathematics, 10(15), 2644. https://doi.org/10.3390/math10152644

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unsupervised Deep Relative Neighbor Relationship Preserving Cross-Modal Hashing

Abstract

1. Introduction

2. The Proposed Method

2.1. Cross-Modal Retrieval Algorithm

2.2. Unsupervised Deep Relative Neighbor Relationship Preserving Cross-Modal Hashing

2.2.1. Deep Learning Binary Feature Vectors

2.2.2. Neighbor Structure Preserving

2.2.3. Cross-Modal Pair-Wise Similarity Preserving

2.2.4. Cross-Modal Triple Relative Similarity Preserving

2.2.5. Optimization

3. Experiments

3.1. Datasets

3.2. Experimental Setting

3.3. Evaluation Metrics

3.4. Experimental Results and Analysis

3.5. Ablation Experiments

3.6. The Parameter Setting Experiments

4. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI