This section first introduces the proposed LHH. Then,
Section 3.1 gives the notation and problem statement. The details of the proposed LHH are presented in
Section 3.2. After this,
Section 3.3 presents the optimization of the proposed LHH. Moreover,
Section 3.4 introduces learning the hashing function. Finally,
Section 3.5 presents the convergence analysis.
3.2. Low-Rank Hypergraph Hashing
To consider the supervised information, we regard learning the binary codes in the context of classification. We enable the binary codes to be optimal for the jointly learned classifier. Thus, the good binary codes are ideal for the classification.
Given binary code
b, we adopt the following multi-class classification formulation
where
,
is the projection for class k and
is the label vector, and the maximum value indicates the assigned class of
x.
We choose to optimize the following problem
where
L(·) is the loss function,
R(
W) is a regularizer and
is a regularization parameter.
is the ground truth label matrix, where
=1 if
xi belongs to class k and 0 otherwise.
Equation (3) is flexible, and we can define any loss function for
L(·). For simplicity, we can choose the simple l
2 loss, which minimizes the difference between the label
Y and prediction
G(
b). The problem in Equation (3) can be transformed into the following problem:
To enable the coefficients of data in the same space to be highly correlated, we apply the low rank constraint to capture the global structure of the whole data. In addition, the low-rank structure can relieve the impact from noises, and makes regression more accurate [
37,
38]. In order to consider the low-rank structure of W, we need to make:
We decompose
W into two low-rank matrices, i.e.,
, where
, and r is the rank of
W. Then, Equation (4) can be further transformed into
where
(
), which is introduced for identifiability. Besides, we additionally enforce the sparsity, i.e., l
21–norm for feature selection by [
39]. Thus, the above problem is rewritten as:
In Equation (7), we consider both low-rankness and sparsity to learn the regression coefficient matrix. Low-rankness deals with the noises, and the l2,1-norm selects features by setting some rows of W to be zero.
Until now, we do not consider the similarity structure among data. If two samples are similar, we need to ensure that two corresponding binary codes are close. To preserve the original local similarity structure, we aim to minimize
where
S (
) is the similarity matrix that records similarities among data, in which
represents the relationship between the i-th and the j-th sample. Normally, we use the following formulation to construct graphs
where
is the kernel width and the term
denotes the distance between two samples.
Here we instead use the hypergraph to measure the similarity among data.
Figure 2 shows the distinction between a normal graph and hypergraph. As can be seen, the normal graph only connects two samples, while a hypergraph can connect more than two samples. Therefore, a hypergraph can reveal more complex relationships among data [
23]. We formulate the incidence matrix
H between the vertices and the hyperedges of the hypergraph as:
The degree
of vertex
v and the degree
of hyperedge are defined as follows:
With the above definition, the normalized distance between
and
on
is
. To preserve the similarity of hash codes, we aim to map data on the same hyperedge into more similar hash codes. Thus, we seek the hash codes by minimizing the average Hamming distance between hash codes of data on the same hyperedge:
By introducing the hypergraph Laplacian, we further rewrite Equation (12) as
where the hypergraph Laplacian matrix
,
I is the identity matrix,
H is the incidence matrix and
and
are diagonal matrices, where the diagonal element of
and
are degrees of the hypergraph vertex
and hyperedge
, respectively.
Combining Equations (7) and (13), the final objective function of LHH is defined as:
where
is a regularization parameter. In Equation (14), to learn high-quality binary codes, the first term learns the classifier with a binary code, the second term minimizes the
l2,1-norm of the projection matrix to explore its low-rankness and sparsity, and the third term preserves the intrinsic complex structure of data via a hypergraph.
3.3. Optimization Algorithm
It is clear that Equation (14) is difficult to find a global solution for, as it is nonconvex. We alternatively solve the sub-problems for the following variables.
(1)
C-step: Update
C by fixing
A and
B.
Algorithm 1 Curvilinear Search Algorithm Based on Cayley Transformation |
Input: initial point , matrix B, C, hash code length l |
Output: A(t). |
1: Initialize t = 0, and . |
2: Repeat |
3: Compute the gradient according to (18); |
4: Generate the skew-symmetric matrix ; |
5: Compute the step size , that satisfies the Armijo-Wolfe conditions [33] via the line search along the path defined by (19); |
6: Set ; |
7: Update t = t + 1; |
8: Until convergence |
In this case, the objective function is simplified as:
Equation (15) can be rewritten as:
We have the derivative of Equation (16) with respect to
C equal to 0, and receive
where
is a diagonal matrix.
(2) A-step: Update A by fixing B and C.
It is hard to obtain an optimal solution in Equation (14) with respect to A, due to the orthogonal constraint. Here we apply a gradient descent with a curvilinear search to seek a locally optimal solution.
First, we denote
G as the gradient of Equation (16) with respect to
A, and it is defined as:
A skew-symmetric matrix is defined as
. The next point is decided by a Crank-Nicolson scheme
where
is the step size. We can get a closed-form solution of
:
Here, Equation (20) is called the Cayley transformation [
33,
40,
41]. The iteration terminates when
τt satisfies the Armijo-Wolfe condition. The algorithm solving the sub-problem is illustrated in Algorithm 1.
(3) B-step: Update B by fixing A and C.
The objective function is simplified as follows:
The above problem is challenging due to the discrete constraint, and it has no closed-form solution. Inspired by the recent study in nonconvex optimization, we optimize Equation (21) with the proximal gradient method, which iteratively optimizes a surrogate function. In the
j-th iteration, we define a surrogate function
that is a discrete approximation of
at the point
. Given
, the next discrete point is obtained by optimizing:
Note that
may include zero entries and that multiple solutions for
may exist, thus we introduce function
to eliminate the zero entries. The updated rule for
is defined as [
42,
43]:
Algorithm 2 Low-Rank Hypergraph Hashing |
Input: label matrix , hash code length l, hyperedge number k; |
Output: A(t), B(t), C(t); |
1: Initialize , , , ; |
2: Initialize t=0, , and , ; |
3: Repeat |
4: C-step: Update C(t) using (17); |
5: A-step: Update A(t) by Algorithm 1; |
6: B-step: Update B(t) using (23); |
7: Update t=t+1; |
8: Until convergence |
The learning algorithm of LHH is shown in Algorithm 2.
3.5. Convergence Analysis and Computational Complexity Analysis
Firstly, we discuss the convergence of LHH, which is presented in the following theorem.
Theorem 1: The alternating iteration scheme of Algorithm 2 monotonically reduces the objective function value of Equation (14), and Algorithm 2 converges to a local minimum of Equation (14).
Proof: LHH includes three sub-problems. The sub-problem C is convex, thus it clearly has the optimal solution. The sub-problems with respect to A and B are non-convex, but A and B steps decrease the objective function value. Thus, Algorithm 2 decreases the objective function value in each step. In addition, the objective function value is non-negative. Thus, Algorithm 2 can converge to a local optimal solution of LHH.
Then, we present the computational complexity of the proposed LHH method. The computational complexity of LHH mainly consists of the following several parts. In the step of updating
, its complexity is
. In the step of updating
, due to the orthogonal constraint, we use the Cayley transformation for solving this problem. Computing the gradient of
requires
and updating
for each iteration is
[
40]. Thus, the complexity of optimizing
is
, where
is the number of iterations for updating
. In the step of updating
, its complexity is
, and it is time-consuming, as it contains hypergraph Laplacian matrix computing. In summary, the total computational complexity of LHH is
, where
is the number of total iterations in Algorithm 2. Finally, the computational complexity of hashing mapping matrix
requires the time complexity of
. For the query part, the computational cost for encoding any query
is
.