3.1. Similarity Diffusion for Single-Layer Features
Given a query image
, retrieval list
,
represents the image that is the
th most similar to the query image in the image set, and
n represents the number of images in the retrieval list that need to be re-ranked. Suppose the list corresponding to all images is
, and the image vector is
. In the diffusion process, a weighted graph
is first built,
X is the vertex set, each vertex corresponds to an image,
E represents the edge set, and the weight of the edges is proportional to the similarity
between the data points. It is worth noting that the similarity between two images is represented by the cosine distance between them, and each element in
E is defined as:
where we have
, and
when
. This results in a symmetric weighted graph
:
To realize the diffusion of similarity information, as done in previous works [
21,
22], we first obtain the row random matrix
from
:
no longer satisfies the symmetry, while satisfying the property in (
4):
This operation defines a Markov random walk on the graph . is the transfer matrix and represents the probability of transferring from vertex to . The diffusion process diffuses the similarity information based on the weights of the edges, and this process can be seen as a random walk on the graph.
From a data analysis point of view, the reason for studying the diffusion process is that the transfer matrix
contains geometric information about the dataset [
23], and the transfer probability defined by the matrix
directly reflects the local geometry defined by the nearest neighbors of each vertex in the graph.
denotes the probability of transferring from a vertex
to
in one time step, which is proportional to the weight of the edges
. For
, the probability of transferring from
to
in
t time steps is
. In the diffusion process, the similarity information is chained forward over time, and all the localized geometries are gradually captured, and ideally the diffusion process can reveal the underlying geometric structure of the data manifold.
The implementation of diffusion is not unique, and in this paper, we use a simple but effective method [
12]:
where
t is a positive integer and
is a decay coefficient that makes
converge as
. Intuitively,
controls the diffusivity of
at fixed
t: the larger the value of
, the greater the influence of each vertex on the others. Typically, the literature [
24] sets
between
and
.
Theorem 1. It follows from the literature [22] that converges to a fixed nontrivial solution for arbitrary:where is a unit matrix of size . Proof. The proof procedure for the convergence of as is as follows:
For (
5), we can write
in the following format:
where
is the form of an infinite geometric progression with the general form
. If
and
r is replaced by
, the geometric progression
in the text is obtained.
For an infinite geometric series , the sum of its first n terms:
when
,
when
,
For the infinite geometric series in the text, since is a row random matrix, all its eigenvalues whose absolute values are less than or equal to 1, therefore ; and , we can obtain , which satisfies the condition in step 2.
For an infinite geometric series
in the text, find the sum of the first
t terms: when
,
when
,
From steps 1 and 4 we can conclude:
The above Equation (
6) is proved. □
Ideally, the diffusion process can reveal the underlying geometric structure of the data manifold. However, the diffusion process is sensitive to noise. If the actual topological structure of the data manifold changes due to noise or outliers, the diffusion process may not be able to capture the correct topological structure. As noise and outliers will affect the distribution of data points, this will cause a certain error in the transition matrix during the diffusion process. At that time, the introduction of local constraints can reduce the impact of noisy data points on the diffusion process.
3.2. Local Constraint Diffusion Based on Contextual Similarity
Since the diffusion process is affected by noise and outliers, in order to minimize the effect of these data points, we introduce a locally constrained diffusion process based on contextual similarity.
In the classical diffusion process, all paths between vertices and are considered when calculating the probability of walking from vertex to . If there are several noisy points in the retrieval list, such as a negative sample image, then the paths through these noisy points affect the calculation of the transfer probability.
In order to solve the above problem, we introduce a method [
25] to limit the random walking on the weighted graph to the
k-contextual nearest neighbors of the current data point, which can mitigate the effect of noise on the transfer probability calculation [
23]. First, a
k-contextual nearest neighbor graph
is constructed from the original weighted graph
. The vertices of
are the same as those in
, while the weights of the edges are different, and the edge weights in the
k-contextual nearest neighbor graph
are defined as (
13):
where
denotes the set of
k-contextual nearest neighbors of
. When
belongs to the set of
k-contextual nearest neighbors of
, the weight is set to the original similarity
, and the weight between non-contextual nearest neighbor points is set to 0. At this point, the probability of the transfer from vertex
walking to
is:
In this case, the convergent solution of (
6) is still satisfied.
Replacing the original transfer probability matrix
with
reduces the effect of noise. The complete information about the similarity of each data point to all other data points is maintained in the matrix
, whereas
only retains information about the similarity of each data point to its contextual nearest neighbors. Essentially, this assumption is that local similarity (high values) is more reliable than distant similarity, and is an assumption widely adopted by other flow learning algorithms [
26,
27].
For how to determine the set
and how to define the contextual proximity of two points, contextual similarity is introduced here. Let
denote the list of retrieval results for image
and
denote the list of retrieval results for image
, then the contextual similarity between
and
can be measured by the Jaccard similarity coefficient [
9,
17]:
where
denotes the size of the bases in the set after intersection or union operations.
Therefore, given an image
x, we first perform a secondary query for each image in its retrieval list
, obtaining additional
n retrieval lists. Then, we calculate the intersection and union between the two retrieval lists, and calculate the context similarity between
x and each image in the corresponding retrieval list
through (
15). Finally, we select the top
k images ranked by context similarity, denoted as the
k context neighbors of image
x,
. That is, the image
x and the images in the set
are context neighbors. The specific process of obtaining the context neighbors is shown in
Figure 3.
Equation (
15) uses additional contextual information but has some limitations. First, computing the intersection and concatenation of two sets of nearest neighbors is very time-consuming, especially when the Jaccard distance needs to be computed for all image pairs. Second, when computing the set of nearest neighbors, all the weights of the nearest neighbors are the same, and each valid nearest neighbor point is noted as 1, while in reality, the image closer to the query image
x should be more similar to
x.
For the first problem, the set of nearest neighbors can be encoded as simpler but equivalent vectors in two quantities, which is more convenient to compute and can greatly reduce the computational complexity. This is solved here by sparse context encoding [
11], which encodes the nearest neighbor sets into nearest neighbor vectors and thus converts the computation of sets into the computation of vectors. Specifically, sparse context encoding converts the set of nearest neighbors
into an
-dimensional nearest neighbor vector
via an indicator function:
where
is much larger than
n, the total number of image libraries. The indicator function
in the formula is defined as follows:
In (
17), each term in the binary vector
indicates whether its corresponding image belongs to the n-nearest neighbor of
, if it is one of the n-nearest neighbors of
, then the corresponding term is 1, otherwise it is 0.
For the second Problem, (
15) assumes that each nearest neighbor is equal and each element in the set of nearest neighbors has the same weight. To solve this problem, we redistribute the weights based on the original distance between the query image and the images in the retrieval list, and get the new indicator function as follows:
where
represents the cosine similarity, as in (
1). In this case, the weight of near neighbors is larger and the weight of distant neighbors is smaller.
Based on the definition of the indicator function, the computation of the intersection and concatenation of
and
can be rewritten as a vectorial computation:
where a
operation is used to compute the minimum value of the corresponding element in the two input vectors and a
operation is used to compute the maximum value of the corresponding element in the two input vectors. Next, the base size of the intersection and concatenation can be obtained by calculating the
paradigm:
therefore, we can rewrite the Jaccard similarity in (
15) as:
Thus, the problems of slow operation and the same weight of data points in the original Jaccard similarity are solved. First, the context neighbor set
is obtained based on the new Jaccard similarity formula. Then, the transfer probability matrix
, which is the diffusion matrix based on local constraints, is obtained through (
13) and (
14), followed by local constraint diffusion. The converged solution of the diffusion result
is finally obtained:
The converged solution contains the similarity information between the query image as well as the retrieved list.
3.3. Convergent Diffusion of Multilayer Features
Different from the diffusion process in
Section 3.1, which performs similarity degree diffusion for only one weighted graph, fusion diffusion can solve the diffusion problem for
weighted graphs
simultaneously. From the previous sections of this paper, we can obtain the weighted maps and transfer probability matrices corresponding to the high-level features and the low-level features, denoted as
,
, and
,
, respectively. The goal of fusion diffusion is to learn a new similarity metric,
M, which is able to utilize the complementarities between multiple visual features to obtain an enhanced similarity metric.
One way to compute
M is to perform a weighted linear combination of multiple similarity measures:
where
denotes the transfer matrix of the
ith weighted graph.
This approach is simple and easy to implement, but ignores the correlation between different similarity measures. There are two other fusion strategies including Tensor Product Fusion [
28,
29] and Cross Diffusion Process [
30,
31]. The general process of Tensor Product Fusion is that, given two different similarity measures, the tensor product graph (TPG) is first constructed, and then two similarities are jointly diffused on the TPG using a diffusion process. Cross Diffusion algorithms are similar to the idea of co-training, where the two state matrices exchange similarity information with each other during iteration, and two parallel diffusion processes are generated. These methods can effectively utilize the correlation between similarities.
Compared with Tensor Product Fusion, Cross-Diffusion has a smaller computational load. Here, we adopt the alternating diffusion algorithm proposed by Lederman et al. [
31] to enact Cross-Diffusion, merging similarity information on two graphs. In the literature, the algorithm was utilized to extract common sources of variation from the measurement results of multiple sensors and defined the alternating diffusion operator
and diffusion distance
d. The alternating diffusion operator can capture the structure of common variables in multiple sensors while disregarding any distinct variables present in a single sensor. The diffusion distance has the ability to seize the structure of a chart by measuring the “connectivity” between two samples across the entire sample set, as opposed to comparing the distances between solitary samples, such as cosine distance or Euclidean distance.
To capture the common similarity information in the two transfer probability maps, we can construct the alternating diffusion operator based on the obtained transfer matrix . Then, we compute the diffusion distance between the samples, i.e., the distance between the pictures after fusing the similarities. According to the diffusion distance sorting, that is, the final retrieval list is obtained. The specific process is as follows.
Construct the alternating diffusion operator
:
Calculate the diffusion distance between two samples:
Intuitively, alternating diffusion operates on the same set of vertices X. However, the diffusion process is divided into two steps, with the first step having a transition probability matrix of and the second step having a transition probability matrix of . The combination of the two consecutive steps is thus a Markov chain on a new weighted graph , where the transition probabilities are determined by the matrix . The diffusion process can be performed in two steps.