Next Article in Journal
An Effective Dense Co-Attention Networks for Visual Question Answering
Previous Article in Journal
IMOVE—An Intuitive Concept Mobility Systems for Perioperative Transfer and Induction of Anaesthesia for Special Needs Children
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Stochastic Recursive Gradient Support Pursuit and Its Sparse Representation Applications

Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, China
*
Author to whom correspondence should be addressed.
Sensors 2020, 20(17), 4902; https://doi.org/10.3390/s20174902
Submission received: 7 July 2020 / Revised: 18 August 2020 / Accepted: 27 August 2020 / Published: 30 August 2020
(This article belongs to the Section Sensing and Imaging)

Abstract

:
In recent years, a series of matching pursuit and hard thresholding algorithms have been proposed to solve the sparse representation problem with 0 -norm constraint. In addition, some stochastic hard thresholding methods were also proposed, such as stochastic gradient hard thresholding (SG-HT) and stochastic variance reduced gradient hard thresholding (SVRGHT). However, each iteration of all the algorithms requires one hard thresholding operation, which leads to a high per-iteration complexity and slow convergence, especially for high-dimensional problems. To address this issue, we propose a new stochastic recursive gradient support pursuit (SRGSP) algorithm, in which only one hard thresholding operation is required in each outer-iteration. Thus, SRGSP has a significantly lower computational complexity than existing methods such as SG-HT and SVRGHT. Moreover, we also provide the convergence analysis of SRGSP, which shows that SRGSP attains a linear convergence rate. Our experimental results on large-scale synthetic and real-world datasets verify that SRGSP outperforms state-of-the-art related methods for tackling various sparse representation problems. Moreover, we conduct many experiments on two real-world sparse representation applications such as image denoising and face recognition, and all the results also validate that our SRGSP algorithm obtains much better performance than other sparse representation learning optimization methods in terms of PSNR and recognition rates.

1. Introduction

In recent years, sparse representation has been proved to be a useful approach to represent or compress high dimensional signals. Sparse representation algorithms have attracted many researchers in the fields of signal processing, image processing, medical imaging, machine learning, computer vision, pattern recognition, and so on [1]. In most applications, the unknown signal of interest is regarded as a sparse combination of a few columns from a given dictionary, and this problem is usually formulated as a sparsity constrained problem. Such sparse representation problems are common in the fields of image denoising, image inpainting, and face recognition or others such as [2,3,4].
Image denoising is a classical problem to improve image quality in computer vision. The aim of this problem is to recover the clean image x from the noisy image y = x + e , where e is additive white Gaussian noise in general [5]. It can be realized generally by the following three types of methods: transform domain [6], spatial filtering [7], and dictionary learning-based methods [8,9]. Note that the dictionary learning-based methods optimize the following model with the step of 0 -norm sparse coding:
min x R d x 0 , s . t . , y D x 2 2 ε ,
where x represents the sparse coding of y R n with a given error tolerance ε > 0 , D R n × d is a given dictionary, x 0 denotes the number of nonzero entities of the vector x , and x 2 is the 2 -norm, i.e.,  x 2 = i x i 2 . This is an 0 -norm constrained minimization problem, which can be solved via convex relaxation or greedy algorithms. The performance of the algorithms for finding a sparse representation solution has a big impact on denoised images. Similar effects can be found in face recognition tasks.
As a challenging application in computer vision and pattern recognition, face recognition has become a more complicated problem on account of illuminations, occlusions, expressions, and facial disguises. Many face recognition methods have been proposed, e.g., Eigenfaces [10] and Fisherfaces [11]. Wright et al. [12] proposed a sparse representation-based classification (SRC) method, which regards the training face images themselves as an overcomplete dictionary. It is possible for a testing face image to be represented by the atoms of the overcomplete dictionary, and thus we need to solve a sparse coding problem similar to Equation (1). Hence, there is the same question to image denoising. In this paper, our main aim is to find a more efficient algorithm for sparse representation on image denoising and face recognition tasks.

1.1. Stochastic Hard Thresholding Methods

So far, there have been many algorithms for pursuing sparse solutions using p -norm ( 0 p 1 ) minimization [1]. Although convex relaxation algorithms are easy to find an optimal solution due to the convexity of 1 -norm, 1 -norm minimization has some limits as pointed out in [13] and sometime obtains worse empirical performance [14]. Thus, it is necessarily to solve the 0 -norm problem directly. In order to obtain the sparse solutions, many greedy algorithms [15,16,17,18,19] have been developed. Moreover, there are some hard thresholding-based methods, such as iterative hard thresholding [20], fast gradient hard thresholding pursuit [21], and gradient support pursuit (GraSP) [22]. All the methods have many successful applications for various real-world problems such as sparse vector and low-rank matrix recovery. However, the hard thresholding methods are deterministic optimization algorithms, they need to compute a full gradient using all training samples and have a high per-iteration complexity O ( n d ) , which makes them unsuitable for real-world large-scale data.
To address this issue, Nguyen et al. [23] proposed a stochastic gradient hard thresholding (SG-HT) algorithm, and introduced the idea of stochastic optimization into hard thresholding methods. It randomly selects one sample to optimize per-iteration and holds a much lower complexity. However, SG-HT cannot decrease the variance between the stochastic gradient and its full gradient. Li et al. [24] proposed a stochastic variance reduced gradient hard thresholding (SVRGHT) method, which uses the stochastic variance reduction gradient (SVRG) technique [25] as well as in [26]. With the help of variance reduction techniques, SVRGHT can obtain a faster convergence rate. More recently, there have been several stochastic hard thresholding algorithms using first-order or second-order information [27,28,29,30,31,32,33]. However, many stochastic algorithms such as SVRGHT have a hard thresholding operation in each iteration, whose computational complexity is relatively high O ( d log ( d ) ) in general [34], especially for high-dimensional data. In addition, there are two main drawbacks for the thresholding methods. The first shortcoming is the optimization theoretical basis. That is, when the current iterate solution is not a minimizer of the function, moving from the solution in the direction of negative gradient of the function leads to the decrease in the value of this function. However, this assumption is not generally true when the hard thresholding operator H s ( · ) is applied to the current vector x t , which means that the gradient information has lost. It breaks the information of the current solution and may waste much computation to perform gradient descent. Secondly, the computational burden of hard thresholding operation is still linear with d, which can not be negligible. There exists an interesting question whether there is an algorithm to overcome these drawbacks. We answer this question affirmatively in theory and in practice.

1.2. Our Contributions

In this paper, we propose the first variance reduced stochastic recursive gradient method for sparse representation problems. In other words, we use the stochastic recursive gradient proposed in [35], which is suitable for solving non-convex problems, to optimize the non-convex sparse representation problem in this paper. In order to keep the gradient information of current iterate as suggested in [36], we perform lots of gradient descent steps, followed by a hard thresholding operation. We also construct the most relevant support on which minimization will be efficient. Therefore, this paper proposes a novel sparsity-constrained algorithm, called stochastic recursive gradient support pursuit (SRGSP). At each iteration in SRGSP, we first find the most relevant support set, minimize slackly over the support set by our stochastic recursive gradient solver, which satisfies a certain descent condition, and then perform hard thresholding on the updated model parameter. The main contributions and novelty of this paper are listed as follows:
(1)
It is is non-trivial that we analyze the statistical estimation performance of SRGSP under mild assumptions, and the theoretical results show that SRGSP obtains a fast linear convergence rate.
(2)
Benefiting from less hard thresholding operations than existing algorithms such as SVRGHT, the average per-iteration cost of our algorithm is much lower ( O ( d ) for SRGSP vs. O ( d log ( d ) ) for SVRGHT), which leads to faster convergence.
(3)
Moreover, less usage of hard thresholding operators to the current variable results in retain of gradient optimization information, which improves empirical performances. Stochastic recursive gradient support pursuit leads to a new trend to reduce the complexity of head thresholding operation while maintaining or even improving the performance.
(4)
We also evaluate the empirical performance of our SRGSP method on sparse linear and logistic regression tasks as well as real-world applications such as image denoising and face recognition. Our experimental results show the efficiency and effectiveness of SRGSP.
The remainder of this paper is organized as follows. In Section 2, we introduce the related applications (i.e., image denoising and face recognition), and we propose our SRGSP algorithm in Section 3. The convergence analysis is provided in Section 4. In Section 5, many experimental results on both synthetic and real-world datasets verify the effectiveness of SRGSP, and the results of image denoising and face recognition further demonstrate the superiority of SRGSP against some state-of-the-art hard thresholding algorithms. Section 6 presents conclusions and future work.

2. Related Work

In this section, we start with a brief description of some related applications, in which sparse representation can play an important role.

2.1. Notation

In this paper, x 0 denotes the number of nonzero entities of the vector x , supp ( x ) denotes the index set of nonzero entities of x , and supp ( x , s ) is the index set of the top s entries of x in terms of magnitude. In addition, we denote I c the complement set of I and x | I the restriction of vector x to the rows indicated by indices in I. Furthermore, we denote H F ( · ) the Hessian matrix of the function F ( · ) , and denote E ( · ) the expectation.

2.2. Sparse Representation-Based Image Denoising

In sparse representation, the clean images or signals can be approximated via a sparse combination of coefficients from a basis set, called dictionary. In this circumstance, denoising a patch vector y j , which is extracted from the noisy image matrix, with a dictionary D R n × d is regarded as solving the following sparsity constrained optimization problem:
min x F ( x ) = def 1 n i = 1 n f i ( x ) = 1 n i = 1 n y i j D i x 2 2 , s . t . , x 0 s ,
where D i x is an estimate of y i j , s is a sparsity constant, and y j is the j-th patch of the noisy image y. There are many dictionary learning algorithms such as [8,37,38], which alternately update the dictionary with learned sparse iterate x . Although these algorithms have demonstrated that learned dictionaries on noisy images or on a set of good quality images can achieve better performance than off-the-shelf ones such as [9], we here use the fixed overcomplete dictionary for verifying the property of our algorithm in sparse coding. The overcomplete dictionary D , which means that the number of columns may be greater than the number of rows, can be obtained by the discrete cosine transform (DCT) [39] or its redundant version, as implemented in [38]. Since general images may be very large, current practices sparsely represent image patches rather than the full image.
In summary, we obtain an overcomplete dictionary matrix D by DCT and then use 0 -norm constrained optimization algorithms to find an approximate solution of Equation (2) to restore the image.

2.3. Sparse Representation-Based Face Recognition

Face recognition is an active research field in computer vision, and this task is to use k classes labeled training samples to classify the testing sample into the correct class. In this paper, we take our algorithm into the SRC framework [12] for face recognition. As SRC uses 1 -norm minimization to solve the sparse coding model, in this paper we use 0 -norm minimization instead, which can also obtain a sparse solution and this algorithm is provided in the Appendix A. In the SRC algorithm, an l × h gray facial training image is reshaped into a column vector a r , u R n , i.e., n = l h . Then we construct the matrix A r = a r , 1 , a r , 2 , , a r , d r R n × d r by using d r training samples belonging to the r-th class. For each testing sample y r = [ y 1 r , , y n r ] R n × 1 in the same class can be linear represented by the columns in A r :
y r = a r , 1 x r , 1 + a r , 2 x r , 2 + + a r , d r x r , d r .
Here, x r , 1 , x r , 2 , , x r , d r are all scalars, which are the representation coefficients for y r . Since the testing sample is unknown, then we consider all training samples of k classes and define a matrix A: A = A 1 , A 2 , , A k R n × d . Therefore, the representation of a testing sample can be rewritten with respect to all the training samples as:
y r = A x 0 ,
where x 0 is a coefficient vector, whose nonzero elements only associated with the r-th class. In this paper, sparse representation with 0 -norm minimization can be used to solve the following sparsity constrained optimization problem in the SRC framework:
min x F ( x ) = 1 n i = 1 n f i ( x ) = 1 n i = 1 n y i r A r o w i x 2 2 , s . t . , x 0 s ,
where s is the sparse constant, which implies the number of nonzero elements of x and A r o w i R 1 × d is the i-th row vector of A. Defining δ p : R d R d is a selection function corresponding to the p-th class. Given a sparse vector x R d from Equation (5), δ p ( x ) R d is a new vector, whose nonzero elements are only associated with the p-th class. Then minimizing the following residual function can encode the identity of the sample y r as follows:
identity ( y r ) = arg min p R r y : = y r A δ p x 2 .

3. Our Stochastic Recursive Gradient Support Pursuit Method

In this section, we propose a novel stochastic recursive gradient support pursuit (SRGSP) method for sparsity constrained problems. Different from existing gradient support pursuit methods (e.g., GraSP [22]), SRGSP only requires to satisfy a certain constrictive condition in each iteration, and thus has a faster convergence speed in practice.
In recent years, many non-convex gradient support pursuit methods such as [20,22] have been proposed, and it has also been shown that they can have better performance than convex 1 -norm methods in certain circumstances. Most of the existing gradient support pursuit algorithms use deterministic optimization methods to minimize various sparse learning problems (e.g., Problem (2)). However, the per-iteration complexity of all these algorithms is O ( n d ) , which leads to slow convergence, especially for large-scale and high-dimensional problems. Inspired by GraSP [22], which is a well-known gradient support pursuit method, we propose an efficient stochastic recursive gradient support pursuit (SRGSP) algorithm to approximate the solution to Problem (2), as outlined in Algorithm 1.
Algorithm 1: Stochastic Recursive Gradient Support Pursuit (SRGSP)
Input: 
Sparsity level s, learning rate η , the numbers of outer-iterations and inter-iterations, T and J.
Initialize: 
x ^ 0 .  
1:
for t = 1 , 2 , , T do
2:
 Compute current gradient: g 0 = F ( x ^ t 1 ) ;   
3:
 Identify directions: Z = supp ( g 0 , 2 s ) ;   
4:
 Merge supports: T = Z supp ( x ^ t 1 ) ;   
5:
 Initialization: z 0 = x ^ t 1 , z 1 = z 0 η g 0 ;   
6:
for j = 1 , 2 , , J do
7:
  Randomly pick i j { 1 , 2 , , n } ;   
8:
   g j = f i j ( z j ) f i j ( z j 1 ) + g j 1 ;   
9:
   z j + 1 = z j η g j ;   
10:
end for  
11:
 Perform hard thresholding over  T : x ^ t = H s ( z J + 1 T ) ;   
12:
end for
Output: 
x ^ T .
At each iteration of Algorithm 1, we first compute the gradient of F ( · ) at the current estimate, i.e.,  g 0 = F ( x ^ t 1 ) . Then we choose 2 s coordinates of g 0 that have the largest magnitude as the direction in which pursuing the minimization will be most effective, and denote their indices by Z , where s is the sparsity constant. Merging the support of the current estimate with the 2 s coordinates mentioned above, we can obtain the combined support, which is a set of at most 3 s indices, i.e.,  T = Z supp ( x ^ t 1 ) (Some parameters used in this paper have already been defined in the second section). Over the current support set T , we compute an estimate b by using stochastic recursive gradient descent as the approximate solution to the problem (7).
The key difference between GraSP and SRGSP is that GraSP needs to yield the exact solution b ^ to the following minimization problem:
min x F ( x ) , s . t . , x T c = 0 ,
where T c is the complement set of T in line 4 of Algorithm 1, while our SRGSP method only requires a sub-solver (e.g., the iteration steps from Step 5 to Step 10 in Algorithm 1), which is to find an approximate solution b to Problem (7) satisfying
b b ^ 2 c 1 x ^ t 1 b ^ 2 ,
where b ^ is the exact solution to Equation (7), x ^ t 1 is the temporary result of last outer-iteration, 0 < c 1 < 1 is an error bound constant, meaning that our algorithm can have a guaranteed decrease at each iteration, as shown in our convergence analysis in the next section. In other words, we can select other efficient solvers (e.g., SVRG [25], VR-SGD [32] and their accelerated variants [40,41]) for the proposed framework, as long as the solvers satisfy the certain constrictive condition in Equation (8). Since stochastic recursive gradient descent in [42,43] has been proved to have a faster convergence rate than other stochastic gradient operators such as SVRG [25] for solving non-convex optimization problems, we choose the former as our solver rather than SVRG as in [24]. When a fully deterministic optimization method is used as a sub-solver in Algorithm 1 for solving Problem (7), GraSP can be viewed as a special case of SRGSP.
In our experiments, we usually set J = 2 n as the number of iterations similar to the original SARAH algorithm [35]. Within each inner-loop of Algorithm 1, our main update rules are as follows:
g j = f i j ( z j ) f i j ( z j 1 ) + g j 1 ,
z j + 1 = z j η g j .
Note that g j is the stochastic recursive gradient, which is first proposed in [35]. That is, our  SRGSP algorithm updates g j using the accumulated stochastic information, which has the advantage of accelerating convergence naturally. The parameter x ^ t is then updated using the hard thresholding operator, which keeps the largest s terms of the intermediate estimate b . This step makes x ^ t as the best s-term approximation of the estimate b . The hard thresholding operator is defined as follows:
[ H s ( x ) ] i = x i , i f i s u p p ( x , s ) , 0 , o t h e r w i s e ,
where x i is the i-th coordinate value of the vector x .
Assumption 1.
The solution to the sub-problem (7) is unique.
From the above analysis, we can find that our SRGSP algorithm uses a hard thresholding operation after lots of stochastic recursive gradient updates, while existing stochastic algorithms such as SVRGHT [24] perform hard thresholding in each inner-iteration, which is very time consuming for high-dimensional problems.

4. Convergence Analysis

In this section, we provide the convergence analysis of our SRGSP algorithm.

4.1. Convergence Property of Our Sub-solver

In this part, we consider the convergence property of our sub-solver in Algorithm 1, that is, our sub-solver can satisfy the descent condition in Equation (8). As most of the algorithms available in the community provide the bound for F ( b ) F ( b ^ ) 2 , our convergence analysis requires other structures in F ( · ) to obtain a bound for b b ^ 2 . Therefore, we would like to introduce the following insightful summary of the structures of F ( · ) [44].
Lemma 1.
Let F ( · ) be a function with a Lipschitz-continuous gradient, the following implications hold:
( S C ) ( E S C ) ( W S C ) ( P L ) ( Q G ) ,
where SC means Strong Convexity, ESC means Essential Strong Convexity, WSC means Weak Strong Convexity, PL means Polyak–Lojasiewicz, and QG means Quadratic Growth. For their definitions, we would refer the reader to [44]. If we further assume that F ( · ) is convex, then we have ( P L ) ( Q G ) .
These results show that QG is the weakest assumption. Next, we prove that our sub-solver satisfies the descent condition in Equation (8).
Theorem 1.
Suppose F ( · ) satisfies the QG-condition with the parameter ρ and is Lipschitz continuous with the parameter L. Assume that the number of inter-iterations, J, is sufficiently large, our sub-solver has the following expected convergence property:
E [ F ( b ) F ( b ^ ) ] c 2 [ F ( x ^ t 1 ) F ( b ^ ) ] ,
where 0 < c 2 < 1 is a constant, and then we have
b b ^ 2 2 c 2 L ρ x ^ t 1 b ^ 2 .
The detailed proofs of Theorem 1 and the theorem below are provided in the Supplementary Material. Similar to the linear convergence analysis of SARAH for solving convex problems in [35], our sub-solver can exhibit expected descent in the objective function value, as shown in Equation (9). If our sub-solver with a sufficiently large number of inter-iterations, then c 2 can be a very small constant. According to Theorem 1, one can easily verify that our sub-solver can satisfy the constrictive condition in Equation (8) when 2 c 2 L ρ c 1 . That is, our sub-solver with a sufficiently large number of inter-iterations can satisfy the constrictive condition in Algorithm 1.

4.2. Convergence Property of SRGSP

Before giving our main convergence result, we first present some important definitions.
Definition 1
(Stable Restricted Hessian). Suppose that F ( · ) is a twice continuously differentiable function, and its Hessian matrix is denoted by H F ( · ) . For a given positive integer k, let
A k ( u ) = sup | supp ( u ) supp ( v ) | k , v 2 = 1 v T H F ( u ) v ,
B k ( u ) = inf | supp ( u ) supp ( v ) | k , v 2 = 1 v T H F ( u ) v ,
for all k-sparse vectors u . Then F ( · ) is said to have a Stable Hessian Property (SRH) with constant μ k , or in short μ k -SRH, if 1 A k ( u ) B k ( u ) μ k .
This definition shows that the SRH condition is similar to various forms of Restricted Strong Convexity (RSC) used in the performance analysis of existing sparsity constrained algorithms [22]. Note that this property is suitable for smooth loss functions, and there are a broad family of loss functions that have Lipschitz-continuous gradients.
Theorem 2.
Let F ( · ) be a twice continuously differentiable function that has μ 4 s -SRH with μ 4 s < 2 , and satisfies Assumption 1. For some ϵ > 0 , we have ϵ < B 4 s ( u ) for all 4 s -sparse u , and { x ^ t } is a sequence generated by Algorithm 1. Then we have
x ^ t x * 2 δ t x ^ 0 x * 2 + [ 1 δ t ] ( 1 + c 1 ) ( 2 μ 4 s + 4 ) ϵ ( 1 δ ) F ( x * ) | I 2 ,
where δ : = ( 1 + c 1 ) ( μ 4 s 2 1 ) + 2 c 1 < 1 , and I is the position of the 3s largest entries of F ( x * ) in magnitude.
As discussed above, our sub-solver with a sufficiently large number of inter-iterations can satisfy the constrictive condition in Equation (8) with a very small constant c 1 , which makes δ < 1 hold. Then we also have 0 < c 1 < 2 μ 4 s 2 1 + μ 4 s 2 . This implies that our sub-solver has to achieve a certain accuracy for the theorem to work. Theorem 2 suggests that our proposed algorithm achieves a linear convergence rate. This error bound consists of two terms, where the first term corresponds to the optimization error and the second term corresponds to the statistical error. After sufficient iterations, the second term will approach to zero. Therefore, our algorithm can always converge to the unknown true parameter x * with increasing of iterations.

5. Experimental Results

In this section, we evaluate the performance of our SRGSP method on synthetic and real-world large-scale datasets. Moreover, we apply SRGSP to tackle various sparse representation problems including image denoising and face recognition tasks. In this work, we only use the two real-world applications to illustrate the excellent performance of our SRGSP algorithm against other sparse learning optimization methods including GraSP [22], SG-HT [23], SVRGHT [24], and loopless semi-stochastic gradient descent with less hard thresholding (LSSG-HT) [34].

5.1. Baseline Methods

We compared the proposed algorithm (i.e., SRGSP) with four state-of-the-art algorithms: gradient support pursuit (GraSP) [22], stochastic gradient descent with hard thresholding (SG-HT) [23], stochastic variance reduced gradient with hard thresholding (SVRGHT) [24] and loopless semi-stochastic gradient descent with less hard thresholding (LSSG-HT) [34].

5.2. Synthetic Data

We generated a synthetic matrix A with size n × d , each row of which is drawn independently from a d-dimensional Gaussian distribution with mean 0 and covariance matrix Σ R d × d . The response vector was generated from the linear model y = A x * + e , where x * R d is the s * -sparse coefficient vector, and the noise e was generated from a multivariate normal distribution N ( 0 , σ 2 I ) with σ 2 = 0.01 . The nonzero entries in x * were sampled independently from a uniform distribution over the interval [ 1 , 1 ] . For the experiments, we constructed two synthetic data: ( 1 ) n = 2500 , d = 5000 , s * = 250 , Σ = I ; ( 2 )   n = 5000 , d = 10 , 000 , s * = 500 and the diagonal entries of the covariance matrix Σ were set to 1, and the other entries were set to 0.1 . The sparsity parameter s was set to s = 1.2 s * for all the algorithms.
Figure 1 shows the performance (including the logarithm of the objective function values and the estimation error x ^ t x * 2 x * 2 ) of all the algorithms on the synthetic data. All the results show that our algorithm converges significantly faster than the state-of-the-art methods in terms of function values and estimation error in all the settings. Although our SRGSP algorithm, SVRGHT and LSSG-HT have been theoretically proved to have a linear convergence rate, SRGSP consistently outperforms SVRGHT and LSSG-HT in terms of number of effective passes.

5.3. Real-World Data

In this subsection, we focus on the two real-world large-scale datasets: rcv1-train and real-sim, which can be downloaded from the LIBSVM website (https://www.csie.ntu.edu.tw/~cjlin/libsvm/). In our experiments, we use the rcv1-train and real-sim datasets to evaluate the performance for linear regression, and the two datasets include 20,242 samples with 47,236 features and 72,309 samples with 20,958 features, respectively. Moreover, we choose s = 200 for the two datasets and compare all the algorithms in terms of logarithm objective value gap versus the number of effective passes and running time (in seconds).
Figure 2 illustrates the performance of our algorithm in terms of the logarithm of function gap (i.e., log ( F ( x ^ t ) F ( x * ) 2 ) ). More specifically, our SRGSP algorithm has a faster convergence rate than the four state-of-the-art sparsity constrained algorithms. In addition, SRGSP has the ability to jump out of a local minimum and can find a better solution, as shown in Figure 2b. Compared with SVRGHT, we can see that the results of our SRGSP in the first few iterations are similar to SVRGHT. However, due to lots of gradient updates followed by a hard thresholding, SRGSP can obtain a better solution, as discussed in [36]. This further verify the advantage of our SRGSP against other methods. On the other hand, our SRGSP algorithm can reach better solutions in much less CPU time than the other methods, including SVRGHT and LSSG-HT. Since SRGSP has one hard thresholding operation per iteration, while SVRGHT needs n operations (n is the number of samples) in each epoch, and thus SVRGHT has a higher per-iteration complexity than SRGSP, especially for large-scale and high-dimensional data. Therefore, our SRGSP algorithm is very suitable for the large-scale non-convex sparsity constrained problem.

5.4. Image Denoising

In this subsection, we apply our SRGSP algorithm to image denoising tasks for evaluating its performance. First of all, the most important of image denoising is to find a suitable dictionary. In fact, the DCT seems like a reliable choice following the Guleryuz’s work [45]. We use the overcomplete DCT dictionary in our experiments. Similar to [8], the error bound ε is set empirically as 1.15 σ . The experiments are conducted on 6 standard benchmark images with synthetic white Gaussian noise. The sparsity level parameter in this experiment is set as s = 10 and the Gaussian random noise with zero mean and standard deviation = σ is added to the standard images. The dictionary in the experiments is the fixed overcomplete DCT dictionary of size 64 × 256 , and is designed to deal with image patches of 8 × 8 pixels. The denoising processes are mainly concentrated on sparse coding of these patches using different sparsity constrained algorithms (e.g., GraSP, SG-HT, SVRGHT, LSSG-HT and SRGSP) and the classical greedy orthogonal matching pursuit (OMP) algorithm [46]. The parameters of x are computed till the loss of Equation (1) lower than the error bound or the number of iterations larger than 32 (half of the row size of dictionary). When computing the sparse representative solution on overlapping patches, all the algorithms have to evaluate the sparse coding solution of 62,001 patches for images of size 256 × 256 or 255,025 patches for images of size 512 × 512 . Then, the restoring patches were averaged in the same the procedure as in [8]. All the experiments are repeated 10 times, and the average results are reported.
Table 1 shows the results (including PSNR and SSIM) of all the algorithms at different noise levels, i.e., the values of σ vary from 5 to 55. As we can see from Table 1, our SRGSP algorithm can obtain higher PSNR and SSIM results than the other methods in all the settings, which indicates that the intrinsic low-dimensional structure can be found by our algorithm. Figure 3 shows the visual results of all the methods (i.e., SRGSP, LSSG-HT, SVRGHT, SG-HT, GraSP and OMP) on the cameraman image with σ = 15 , where s = 10 . It is clearly visible that the sky region of the cameraman image is well restored by SRGSP, while the results of other methods are not well recovered. Moreover, our SRGSP algorithm has a higher PSNR value of 29.08 dB, compared to 27.96 dB of SVRGHT and 27.32 dB of OMP. The SSIM results of OMP, SVRGHT and SRGSP are 0.6006, 0.8588, 0.8761, respectively. All the above results demonstrate the effectiveness of SRGSP for image denoising tasks.
Figure 4, Figure 5 and Figure 6 show the denoising results of all the algorithms on the hill, pepper and boat images with different noise levels (e.g., σ = 25 , 35, 45). We can see that our SRGSP algorithm consistently outperforms the other methods in terms of both PSNR and SSIM. Moreover, we give the following suggestion of empirical parameter-tuning for our SRGSP to obtain a good result. Based on all the experimental results, we find that in image denoising tasks, the error bound ϵ can be set empirically to 1.15 σ for yielding a good result. For a general parameter setting, the number of outer-iteration T is set in the interval [20, 30] and its inter-iteration is set to integer multiples of the number of samples.

5.5. Face Recognition

In this subsection, we evaluate the performance of our SRGSP algorithm for face recognition on two real-world face datasets. More specifically, SRGSP is used as the solver in the sparse representation-based classification (SRC) framework [12]. We compare the recognition rates of SRGSP with those of some state-of-the-art sparsity constrained algorithms, such as GraSP, SG-HT, SVRGHT and LSSG-HT. In order to evaluate robustness of our algorithm, we manually add Gaussian noise to the face data.

5.5.1. Datasets

Although there are many datasets available for face recognition, we choose two common publicly datasets, i.e., the AR database [47] and the extended Yale B database [48]. The extended Yale B database contains 2414 frontal-face images of 38 people under different controlled lighting conditions [48]. For each individual, we randomly choose 26 images for training and 15 images for testing. The AR database contains over 4000 color images corresponding to 126 people’s faces (70 Male and 56 Female). The images are obtained under different situation including different facial expressions, illumination conditions, and occlusions such as scarf or sunglasses. For simplicity, we randomly choose 100 objects, and each object has 15 images for training and 11 for testing. Note that the AR database may be difficult for face recognition because there are more classes to recognize and few training samples for each object.

5.5.2. Experimental Setup

For each sparsity constrained algorithm, the sparsity level makes great difference to the solution of sparse representation, especially at different noise levels. Therefore, in order to approach the best performance of all these algorithms, we change the sparsity parameter within a certain range for each algorithm, i.e., s [ 5 , 10 , 20 , 30 , 40 , 50 ] . Thus, we can make sure that all these algorithms achieve the best recognition rates in the parameter setting. In all settings of the experiments, the images are down-sampled to 32 × 32 pixels. As in [12], a series of processing operations are made to the above datasets. We first rescale the training matrix into [0,1] for the convenience of adding noise, and then add Gaussian noise with zero mean and standard deviation σ . Finally, we normalize the columns of the training matrix to have unit 2 -norm.

5.5.3. Results on Real-World Face Data

In this part, we report the recognition rates of SRGSP on both the AR and the extended Yale B databases. Figure 7 shows the real testing images from the AR database with different Gaussian noise levels (e.g., σ = 0.25 and σ = 0.5 ). As we can see, it is challenging for humans to correctly recognize the face images under this situation. However, even in this extreme conditions, SRGSP achieves a high recognition rate with high probability.
In Figure 8, SVRGHT, LSSG-HT and SRGSP obtain much higher recognition rates than other two methods on both datasets, which verifies the superiority of variance reduction or recursive gradient techniques. Although at the very early iterations, SRGSP may have slightly lower recognize rates than SVRGHT, while with the increase of iterations, SRGSP can achieve the highest recognition rate among all the classifiers. Moreover, SRGSP is several times faster than SVRGHT due to less hard thresholding operations. For example, when σ = 0.25 on the Yale B database, SRGSP obtains over 90 % recognition rate within 50 s, while SVRGHT needs more CPU time to reach the same accuracy. In fact, in the same number of passes, SRGSP still achieves the highest recognition rate among the state-of-the-art hard thresholding algorithms, which demonstrates the effectiveness and efficiency of SRGSP. In the extreme situation of the AR database with the Gaussian noise level σ = 0.5 , SRGSP still achieves a higher recognition rate than the other methods. The results of recognition rates with respect to the number of passes on both datasets are provided in Figure 9 and Figure 10, and further demonstrate the superiority of SRGSP.

6. Conclusions and Future Work

In this paper, we proposed a stochastic recursive gradient support pursuit (SRGSP) method for solving large-scale sparsity constrained optimization problems. We also provided the convergence analysis of SRGSP, which shows that SRGSP obtains a linear convergence rate. As existing hard thresholding-based algorithms need more thresholding operations, SRGSP just needs a hard thresholding operation in each epoch, and thus has a significantly per-iteration lower computational complexity, i.e., O ( d ) vs. O ( d log ( d ) ) . Experimental results on synthetic and real large-scale datasets verified the effectiveness and efficiency of SRGSP.
Moreover, we also applied our SRGSP method to tackle image denoising and face recognition tasks, where sparse representation learning plays an important role. Our experimental results show that SRGSP outperforms other sparse representation methods in terms of PSNR and recognition rates. Note that for the image denoising application, the dictionary is the fixed overcomplete DCT dictionary. Inspired by some sophisticated methods such as K-SVD [38], we will iteratively update the dictionary in the future, which can further improve performance. In fact, there are many real-world sparse representation learning applications such as image super-resolution, image restoration, image classification and visual tracking. Therefore, we will apply our SRGSP method to address more applications in the future. In addition, our SRGSP algorithm can be extended to tackle low-rank matrix and tensor completion and recovery problems as in [49,50,51].

Author Contributions

Methodology and Formal analysis, F.S.; Software and Formal analysis, B.W.; Formal analysis, Y.L.; Investigation, H.L.; Visualization, S.W.; Review & editing, L.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Nos. 61876220, 61876221, 61976164, 61836009 and U1701267, and 61871310), the Project supported the Foundation for Innovative Research Groups of the National Natural Science Foundation of China (No. 61621005), the Program for Cheung Kong Scholars and Innovative Research Team in University (No. IRT_15R53), the Fund for Foreign Scholars in University Research and Teaching Programs (the 111 Project) (No. B07048), the Science Foundation of Xidian University (Nos. 10251180018 and 10251180019), the National Science Basic Research Plan in Shaanxi Province of China (Nos. 2019JQ-657 and 2020JM-194), and the Key Special Project of China High Resolution Earth Observation System-Young Scholar Innovation Fund.

Acknowledgments

We thank all the reviewers for their valuable comments.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

In this paper, we use A = { a 1 , a 2 , , a n } R d × n to denote the design matrix, y = [ y 1 , y 2 , , y d ] T R d to denote the response vector and x = [ x 1 , x 2 , , x n ] T R n to denote the model parameter. For the parameter vector x R n , x 2 denotes its 2 -norm, while x 1 is the 1 -norm.

Sparse Representation-Based Classification (SRC)

In this paper, we apply our algorithm into the SRC framework for face recognition tasks, as outlined in Algorithm A1.
Algorithm A1: Sparse Representation-based Classification
Input: 
A matrix of training samples A = [ A 1 , A 2 , , A k ] R d × n for k classes, sparsity parameter s, and a test sample y R d .   
1:
Normalize the columns of A to have unit 2 -norm;   
2:
Solve the 0 -minimization problem:
   
x ^ = arg min y A x 2 2 , s . t . , x 0 s ;
3:
Compute the residuals r i ( y ) = y A δ i ( x ^ ) 2 for i = 1 , , k ;
Output: 
identity ( y ) = arg min r i ( y ) .

References

  1. Zhang, Z.; Xu, Y.; Yang, J.; Li, X.; Zhang, D. A survey of sparse representation: Algorithms and applications. IEEE Access 2015, 3, 490–530. [Google Scholar] [CrossRef]
  2. Liu, S.; Hu, Q.; Li, P.; Zhao, J.; Wang, C.; Zhu, Z. Speckle Suppression Based on Sparse Representation with Non-Local Priors. Remote Sens. 2018, 10, 439. [Google Scholar] [CrossRef]
  3. Tu, B.; Zhang, X.; Kang, X.; Zhang, G.; Wang, J.; Wu, J. Hyperspectral Image Classification via Fusing Correlation Coefficient and Joint Sparse Representation. IEEE Geoence Remote Sens. Lett. 2018, 15, 340–344. [Google Scholar] [CrossRef]
  4. Liu, S.; Liu, M.; Li, P.; Zhao, J.; Zhu, Z.; Wang, X. SAR Image Denoising via Sparse Representation in Shearlet Domain Based on Continuous Cycle Spinning. IEEE Trans. Geoence Remote Sens. 2017, 55, 2985–2992. [Google Scholar] [CrossRef]
  5. Shao, L.; Yan, R.; Li, X.; Liu, Y. From heuristic optimization to dictionary learning: A review and comprehensive comparison of image denoising algorithms. IEEE Trans. Cybern. 2014, 44, 1001–1013. [Google Scholar] [CrossRef]
  6. Dabov, K.; Foi, A.; Katkovnik, V.; Egiazarian, K.O. Image Denoising by Sparse 3-D Transform-Domain Collaborative Filtering. IEEE Trans. Image Process. 2007, 16, 2080–2095. [Google Scholar] [CrossRef]
  7. Yan, R.; Shao, L.; Cvetkovic, S.D.; Klijn, J. Improved nonlocal means based on pre-classification and invariant block matching. J. Disp. Technol. 2012, 8, 212–218. [Google Scholar] [CrossRef]
  8. Elad, M.; Aharon, M. Image denoising via learned dictionaries and sparse representation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 17–22 June 2006; pp. 895–900. [Google Scholar]
  9. Mairal, J.; Bach, F.; Ponce, J.; Sapiro, G.; Zisserman, A. Non-local sparse models for image restoration. In Proceedings of the IEEE 12th International Conference on Computer Vision (ICCV), Kyoto, Japan, 29 September–2 October 2009; pp. 2272–2279. [Google Scholar]
  10. Turk, M.; Pentland, A. Eigenfaces for recognition. J. Cogn. Neurosci. 1991, 3, 71–86. [Google Scholar] [CrossRef]
  11. Belhumeur, P.N.; Hespanha, J.P.; Kriegman, D.J. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell. 1997, 19, 711–720. [Google Scholar] [CrossRef] [Green Version]
  12. Wright, J.; Yang, A.Y.; Ganesh, A.; Sastry, S.S.; Ma, Y. Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 210–227. [Google Scholar] [CrossRef] [Green Version]
  13. Candes, E.J.; Wakin, M.B.; Boyd, S.P. Enhancing sparsity by reweighted L1 minimization. J. Fourier Anal. Appl. 2008, 14, 877–905. [Google Scholar] [CrossRef]
  14. Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
  15. Mallat, S.G.; Zhang, Z. Matching pursuits with time-frequency dictionaries. IEEE Trans. Signal Process. 1993, 41, 3397–3415. [Google Scholar] [CrossRef] [Green Version]
  16. Pati, Y.C.; Rezaiifar, R.; Krishnaprasad, P.S. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of the 27th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 1–3 November 1993; pp. 40–44. [Google Scholar]
  17. Needell, D.; Vershynin, R. Signal recovery from incomplete and inaccurate measurements via regularized orthogonal matching pursuit. arXiv 2007, arXiv:0712.1360. [Google Scholar] [CrossRef] [Green Version]
  18. Dai, W.; Milenkovic, O. Subspace pursuit for compressive sensing signal reconstruction. IEEE Trans. Inf. Theory 2009, 55, 2230–2249. [Google Scholar] [CrossRef] [Green Version]
  19. Needell, D.; Tropp, J.A. CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Commun. ACM 2010, 53, 93–100. [Google Scholar] [CrossRef]
  20. Blumensath, T.; Davies, M. Iterative hard thresholding for compressed sensing. Appl. Comput. Harmon. Anal. 2009, 27, 265–274. [Google Scholar] [CrossRef] [Green Version]
  21. Yuan, X.; Li, P.; Zhang, T. Gradient Hard Thresholding Pursuit for Sparsity-Constrained Optimization. Available online: http://proceedings.mlr.press/v32/yuan14.pdf (accessed on 29 August 2020).
  22. Bahmani, S.; Raj, B.; Boufounos, P.T. Greedy sparsity-constrained optimization. J. Mach. Learn. Res. 2013, 14, 807–841. [Google Scholar]
  23. Nguyen, N.; Needell, D.; Woolf, T. Linear Convergence of Stochastic Iterative Greedy Algorithms with Sparse Constraints. IEEE Trans. Inf. Theory 2017, 63, 6869–6895. [Google Scholar] [CrossRef]
  24. Li, X.; Zhao, T.; Arora, R.; Liu, H.; Haupt, J. Stochastic Variance Reduced Optimization for Nonconvex Sparse Learning. Available online: http://proceedings.mlr.press/v48/lid16.pdf (accessed on 29 August 2020).
  25. Johnson, R.; Zhang, T. Accelerating Stochastic Gradient Descent Using Predictive Variance Reduction. Available online: http://papers.nips.cc/paper/4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction (accessed on 29 August 2020).
  26. Shen, J.; Li, P. A tight bound of hard thresholding. J. Mach. Learn. Res. 2017, 18, 7650–7691. [Google Scholar]
  27. Chen, J.; Gu, Q. Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, New York, NY, USA, 25–29 June 2016. [Google Scholar]
  28. Gao, H.; Huang, H. Stochastic Second-Order Method for Large-Scale Nonconvex Sparse Learning Models. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI 2018), Stockholm, Sweden, 13–19 July 2018; pp. 2128–2134. [Google Scholar]
  29. Chen, J.; Gu, Q. Fast newton hard thresholding pursuit for sparsity constrained nonconvex optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 15–17 August 2017; ACM: New York, NY, USA, 2017; pp. 757–766. [Google Scholar]
  30. Shang, F.; Liu, Y.; Cheng, J.; Zhuo, J. Fast stochastic variance reduced gradient method with momentum acceleration for machine learning. arXiv 2017, arXiv:1703.07948. [Google Scholar]
  31. Liang, G.; Tong, Q.; Zhu, C.; Bi, J. An Effective Hard Thresholding Method Based on Stochastic Variance Reduction for Nonconvex Sparse Learning. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 1585–1592. [Google Scholar]
  32. Shang, F.; Zhou, K.; Liu, H.; Cheng, J.; Tsang, I.; Zhang, L.; Tao, D.; Jiao, L. VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning. IEEE Trans. Knowl. Data Eng. 2020, 32, 188–202. [Google Scholar] [CrossRef] [Green Version]
  33. Liu, Y.; Shang, F.; Liu, H.; Kong, L.; Jiao, L.; Lin, Z. Accelerated Variance Reduction Stochastic ADMM for Large-Scale Machine Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2020. [Google Scholar] [CrossRef]
  34. Liu, X.; Wei, B.; Shang, F.; Liu, H. Loopless Semi-Stochastic Gradient Descent with Less Hard Thresholding for Sparse Learning. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; ACM: New York, NY, USA, 2019; pp. 881–890. [Google Scholar]
  35. Nguyen, L.M.; Liu, J.; Scheinberg, K.; Takáč, M. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2613–2621. [Google Scholar]
  36. Zhao, Y.B. Optimal k-thresholding algorithms for sparse optimization problems. SIAM J. Optim. 2020, 30, 31–55. [Google Scholar] [CrossRef] [Green Version]
  37. Engan, K.; Rao, B.D.; Kreutz-Delgado, K. Frame design using FOCUSS with method of optimal directions (MOD). In Proceedings of the NORSIG, Oslo, Norway, 9–11 September 1999; pp. 65–69. [Google Scholar]
  38. Aharon, M.; Elad, M.; Bruckstein, A. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 2006, 54, 4311–4322. [Google Scholar] [CrossRef]
  39. Ahmed, N.; Natarajan, T.; Rao, K.R. Discrete cosine transform. IEEE Trans. Comput. 1974, 100, 90–93. [Google Scholar] [CrossRef]
  40. Zhou, K.; Shang, F.; Cheng, J. A Simple Stochastic Variance Reduced Algorithm with Fast Convergence Rates. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 5975–5984. [Google Scholar]
  41. Shang, F.; Jiao, L.; Zhou, K.; Cheng, J.; Ren, Y.; Jin, Y. ASVRG: Accelerated Proximal SVRG. In Proceedings of the Asian Conference on Machine Learning, Beijing, China, 14–16 November 2018; pp. 815–830. [Google Scholar]
  42. Yuan, H.; Lian, X.; Li, C.J.; Liu, J.; Hu, W. Efficient Smooth Non-Convex Stochastic Compositional Optimization via Stochastic Recursive Gradient Descent. In Advances in Neural Information Processing Systems; NIPS: Vancouver, BC, Canada, 2019; pp. 6926–6935. [Google Scholar]
  43. Zhou, P.; Yuan, X.T.; Yan, S.; Feng, J. Faster First-Order Methods for Stochastic Non-Convex Optimization on Riemannian Manifolds. IEEE Trans. Pattern Anal. Mach. Intell. 2019. [Google Scholar] [CrossRef] [Green Version]
  44. Karimi, H.; Nutini, J.; Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germany, 2016; pp. 795–811. [Google Scholar]
  45. Guleryuz, O.G. Nonlinear approximation based image recovery using adaptive sparse reconstructions. In Proceedings of the 2003 International Conference on Image Processing, Barcelona, Spain, 14–17 September 2003; pp. 713–716. [Google Scholar]
  46. Tropp, J.A.; Gilbert, A.C. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans. Inf. Theory 2007, 53, 4655–4666. [Google Scholar] [CrossRef] [Green Version]
  47. Martinez, A.M. The AR face database. Available online: https://ci.nii.ac.jp/naid/10011462458/ (accessed on 29 August 2020).
  48. Georghiades, A.S.; Belhumeur, P.N.; Kriegman, D.J. From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 643–660. [Google Scholar] [CrossRef] [Green Version]
  49. Shang, F.; Cheng, J.; Liu, Y.; Luo, Z.Q.; Lin, Z. Bilinear Factor Matrix Norm Minimization for Robust PCA: Algorithms and Applications. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2066–2080. [Google Scholar] [CrossRef] [Green Version]
  50. Liu, Y.; Shang, F.; Fan, W.; Cheng, J.; Cheng, H. Generalized Higher-Order Orthogonal Iteration for Tensor Decomposition and Completion. Available online: http://papers.nips.cc/paper/5476-generalized-higher-order-orthogonal-iteration-for-tensor-decomposition-and-completion (accessed on 29 August 2020).
  51. Liu, Y.; Shang, F.; Fan, W.; Cheng, J.; Cheng, H. Generalized Higher Order Orthogonal Iteration for Tensor Learning and Decomposition. IEEE Trans. Neural Netw. Learn. Syst. 2016, 27, 2551–2563. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Comparison of gradient support pursuit (GraSP) [22], gradient support pursuit (SG-HT) [23], stochastic variance reduced gradient with hard thresholding (SVRGHT) [24], loopless semi-stochastic gradient descent with less hard thresholding (LSSG-HT) [34] and SRGSP for solving sparse linear regression problems on synthetic data. (a) n = 2500 , d = 5000 , s * = 250 ; (b) n = 5000 , d = 10 , 000 , s * = 500 .
Figure 1. Comparison of gradient support pursuit (GraSP) [22], gradient support pursuit (SG-HT) [23], stochastic variance reduced gradient with hard thresholding (SVRGHT) [24], loopless semi-stochastic gradient descent with less hard thresholding (LSSG-HT) [34] and SRGSP for solving sparse linear regression problems on synthetic data. (a) n = 2500 , d = 5000 , s * = 250 ; (b) n = 5000 , d = 10 , 000 , s * = 500 .
Sensors 20 04902 g001
Figure 2. Comparison of GraSP [22], SG-HT [23], SVRGHT [24], LSSG-HT [34] and our SRGSP method for solving sparse linear regression problems. In each plot, the vertical axis shows the logarithm of the objective value minus the minimum, and the horizontal axis is the number of effective passes over data or running time (in seconds). (a) rcv1; (b) real-sim.
Figure 2. Comparison of GraSP [22], SG-HT [23], SVRGHT [24], LSSG-HT [34] and our SRGSP method for solving sparse linear regression problems. In each plot, the vertical axis shows the logarithm of the objective value minus the minimum, and the horizontal axis is the number of effective passes over data or running time (in seconds). (a) rcv1; (b) real-sim.
Sensors 20 04902 g002
Figure 3. Comparison of the denoising results (PSNR/SSIM) of all the algorithms on the standard cameraman image with σ = 15 , where s = 10 . (a) Noise image ( σ = 15 ); (b) OMP (27.62/0.6006); (c) SG-HT (22.90/0.7308); (d) GraSP (23.50/0.8453); (e) SVRGHT (27.96/0.8588); (f) LSSG-HT (27.26/0.8433); (g) SRGSP (29.08/0.8761).
Figure 3. Comparison of the denoising results (PSNR/SSIM) of all the algorithms on the standard cameraman image with σ = 15 , where s = 10 . (a) Noise image ( σ = 15 ); (b) OMP (27.62/0.6006); (c) SG-HT (22.90/0.7308); (d) GraSP (23.50/0.8453); (e) SVRGHT (27.96/0.8588); (f) LSSG-HT (27.26/0.8433); (g) SRGSP (29.08/0.8761).
Sensors 20 04902 g003
Figure 4. Comparison of the denoising results (PSNR/SSIM) of all the algorithms on the standard hill image with σ = 25 , where s = 10 . (a) Original image; (b) noise image ( σ = 25 ); (c) SG-HT (25.58/0.5711); (d) OMP (21.90/0.4065); (e) GraSP (27.01/0.6748); (f) SVRGHT (28.16/0.7012); (g) LSSG-HT (27.86/0.7095); (h) SRGSP (28.53/0.7164).
Figure 4. Comparison of the denoising results (PSNR/SSIM) of all the algorithms on the standard hill image with σ = 25 , where s = 10 . (a) Original image; (b) noise image ( σ = 25 ); (c) SG-HT (25.58/0.5711); (d) OMP (21.90/0.4065); (e) GraSP (27.01/0.6748); (f) SVRGHT (28.16/0.7012); (g) LSSG-HT (27.86/0.7095); (h) SRGSP (28.53/0.7164).
Sensors 20 04902 g004
Figure 5. Comparison of the denoising results (PSNR/SSIM) of all the algorithms on the standard peppers image with σ = 35 , where s = 10 . (a) Original image; (b) noise image ( σ = 35 ); (c) SG-HT (21.65/0.5514); (d) OMP (20.48/0.3529); (e) GraSP (23.91/0.7546); (f) SVRGHT (25.88/0.7390); (g) LSSG-HT (25.12/0.7023); (h) SRGSP (26.56/0.7851).
Figure 5. Comparison of the denoising results (PSNR/SSIM) of all the algorithms on the standard peppers image with σ = 35 , where s = 10 . (a) Original image; (b) noise image ( σ = 35 ); (c) SG-HT (21.65/0.5514); (d) OMP (20.48/0.3529); (e) GraSP (23.91/0.7546); (f) SVRGHT (25.88/0.7390); (g) LSSG-HT (25.12/0.7023); (h) SRGSP (26.56/0.7851).
Sensors 20 04902 g005
Figure 6. Comparison of the denoising results (PSNR/SSIM) of all the algorithms on the standard boat image with σ = 45 , where s = 10 . (a) Original image; (b) noise image ( σ = 45 ); (c) SG-HT (22.62/0.4135); (d) OMP (16.77/0.2176); (e) GraSP (24.42/0.6146); (f) SVRGHT (24.68/0.5172); (g) LSSG-HT (25.51/0.5321); (h) SRGSP (25.68/0.6519).
Figure 6. Comparison of the denoising results (PSNR/SSIM) of all the algorithms on the standard boat image with σ = 45 , where s = 10 . (a) Original image; (b) noise image ( σ = 45 ); (c) SG-HT (22.62/0.4135); (d) OMP (16.77/0.2176); (e) GraSP (24.42/0.6146); (f) SVRGHT (24.68/0.5172); (g) LSSG-HT (25.51/0.5321); (h) SRGSP (25.68/0.6519).
Sensors 20 04902 g006
Figure 7. Examples of the AR face database with different levels of white Gaussian noise. (a) σ = 0 ; (b) σ = 0.25 ; (c) σ = 0.5 .
Figure 7. Examples of the AR face database with different levels of white Gaussian noise. (a) σ = 0 ; (b) σ = 0.25 ; (c) σ = 0.5 .
Sensors 20 04902 g007
Figure 8. Recognition rates of all the algorithms on the extended Yale B (top) and AR (bottom) databases with different levels of Gaussian noise: σ = 0.25 and σ = 0.5 . (a) The Yale B database with σ = 0.25 (left) and σ = 0.5 (right); (b) the AR database with σ = 0.25 (left) and σ = 0.5 (right).
Figure 8. Recognition rates of all the algorithms on the extended Yale B (top) and AR (bottom) databases with different levels of Gaussian noise: σ = 0.25 and σ = 0.5 . (a) The Yale B database with σ = 0.25 (left) and σ = 0.5 (right); (b) the AR database with σ = 0.25 (left) and σ = 0.5 (right).
Sensors 20 04902 g008aSensors 20 04902 g008b
Figure 9. Comparison of recognition rates of all the algorithms on the extended Yale B and AR databases with the Gaussian noise level σ = 0.25 . (a) The Yale B dataset; (b) the AR dataset.
Figure 9. Comparison of recognition rates of all the algorithms on the extended Yale B and AR databases with the Gaussian noise level σ = 0.25 . (a) The Yale B dataset; (b) the AR dataset.
Sensors 20 04902 g009
Figure 10. Comparison of recognition rates of all the algorithms on the extended Yale B and AR databases with the Gaussian noise level σ = 0.5 . (a) The Yale B dataset; (b) The AR dataset.
Figure 10. Comparison of recognition rates of all the algorithms on the extended Yale B and AR databases with the Gaussian noise level σ = 0.5 . (a) The Yale B dataset; (b) The AR dataset.
Sensors 20 04902 g010
Table 1. The denoising results (PSNR (dB)/SSIM) of all the methods including OMP [46], GraSP [22], SG-HT [23], SVRGHT [24], LSSG-HT [34] and our SRGSP method on 6 standard images at different noise levels from 5 to 55. The best performance is shown in bold.
Table 1. The denoising results (PSNR (dB)/SSIM) of all the methods including OMP [46], GraSP [22], SG-HT [23], SVRGHT [24], LSSG-HT [34] and our SRGSP method on 6 standard images at different noise levels from 5 to 55. The best performance is shown in bold.
σ AlgorithmsPeppersCameramanHouseManHillBoat
5SRGSP34.45/0.925934.12/0.919535.64/0.932234.15/0.914934.25/0.913234.68/0.9256
LSSG-HT33.56/0.902232.65/0.893234.12/0.912333.26/0.886533.35/0.899533.32/0.9203
SVRGHT33.95/0.913533.25/0.906534.01/0.902333.95/0.899733.39/0.901233.61/0.9165
SG-HT25.56/0.780126.62/0.772527.56/0.783527.65/0.783227.85/0.753225.89/0.7710
GraSP27.89/0.883224.72/0.863230.85/0.885728.25/0.775528.65/0.827826.91/0.7897
OMP34.02/0.893233.25/0.870634.23/0.876932.35/0.856233.35/0.862333.73/0.9143
10SRGSP32.64/0.894232.05/0.886733.95/0.893232.13/0.854932.85/0.853532.56/0.8691
LSSG-HT31.68/0.882230.22/0.782333.15/0.873431.56/0.830531.56/0.846231.98/0.85.65
SVRGHT31.95/0.883529.56/0.765633.05/0.872131.35/0.849731.12/0.834231.56/0.8479
SG-HT24.48/0.753224.98/0.727226.85/0.737326.95/0.732626.89/0.726525.56/0.7235
GraSP27.56/0.859624.25/0.832330.15/0.865727.65/0.765128.26/0.757826.54/0.7589
OMP30.25/0.765629.56/0.768529.52/0.768528.36/0.726529.65/0.755229.89/0.7551
15SRGSP30.57/0.875929.10/0.870732.81/0.862230.34/0.824930.65/0.799030.53/0.8191
LSSG-HT29.95/0.843227.26/0.710232.12/0.823429.96/0.810530.24/0.789530.35/0.7956
SVRGHT30.35/0.858527.84/0.692732.26/0.851330.09/0.819730.34/0.789730.34/0.8079
SG-HT23.07/0.731122.92/0.687225.53/0.707326.12/0.706626.57/0.672124.89/0.6635
GraSP27.04/0.844923.46/0.808629.65/0.835726.97/0.745527.87/0.717826.11/0.7207
OMP27.76/0.660227.32/0.600627.80/0.576927.69/0.650427.70/0.659226.37/0.6051
25SRGSP28.19/0.823227.37/0.818630.39/0.822428.10/0.747028.53/0.716428.18/0.7445
LSSG-HT27.35/0.778526.35/0.512129.56/0.780627.85/0.753127.86/0.684227.96/0.7095
SVRGHT27.85/0.760626.85/0.523229.48/0.779727.92/0.732328.09/0.699827.82/0.7012
SG-HT22.34/0.638622.40/0.581024.68/0.597425.15/0.603025.57/0.575724.13/0.5711
GraSP26.12/0.807824.66/0.771528.50/0.803427.09/0.718827.00/0.661725.54/0.6748
OMP23.40/0.469623.23/0.430423.35/0.383823.30/0.445823.31/0.445721.91/0.4065
35SRGSP26.55/0.785125.84/0.767428.62/0.790926.84/0.696627.29/0.664226.74/0.6920
LSSG-HT25.16/0.702325.32/0.681227.65/0.712626.48/0.661026.56/0.621526.25/0.6126
SVRGHT25.94/0.739025.46/0.690727.46/0.704626.52/0.662726.85/0.637126.05/0.6025
SG-HT21.68/0.551421.77/0.484023.78/0.491624.28/0.515924.68/0.495023.37/0.4870
GraSP23.96/0.754622.91/0.708026.33/0.754925.67/0.663926.29/0.626925.02/0.6442
OMP20.46/0.352920.24/0.326420.40/0.278020.42/0.321520.38/0.312118.94/0.2882
45SRGSP25.26/0.750124.76/0.725827.53/0.760125.83/0.657526.40/0.624725.71/0.6519
LSSG-HT24.05/0.650224.12/0.632126.64/0.653225.32/0.596225.62/0.593325.51/0.5321
SVRGHT24.44/0.672224.24/0.615126.28/0.637525.46/0.600925.95/0.580724.71/0.5172
SG-HT21.05/0.475421.24/0.45723.04/0.412023.38/0.438623.80/0.420722.63/0.4135
GraSP23.30/0.726422.70/0.684825.76/0.733025.03/0.635525.71/0.601424.43/0.6146
OMP18.27/0.272418.20/0.260418.24/0.213418.20/0.238418.18/0.226116.77/0.2176
55SRGSP24.28/0.725423.90/0.694626.28/0.720325.16/0.627125.74/0.598824.84/0.6179
LSSG-HT23.35/0.621223.01/0.562125.43/0.596324.52/0.563225.01/0.542324.32/0.4623
SVRGHT23.62/0.618323.22/0.551925.23/0.570824.74/0.548025.27/0.534923.61/0.4462
SG-HT20.51/0.412720.60/0.348422.25/0.344922.59/0.377422.98/0.363521.91/0.3546
GraSP22.86/0.705022.36/0.664625.04/0.705024.57/0.613625.30/0.585423.94/0.5918
OMP16.50/0.220316.43/0.217416.44/0.166516.48/0.185216.47/0.172015.02/0.1667

Share and Cite

MDPI and ACS Style

Shang, F.; Wei, B.; Liu, Y.; Liu, H.; Wang, S.; Jiao, L. Stochastic Recursive Gradient Support Pursuit and Its Sparse Representation Applications. Sensors 2020, 20, 4902. https://doi.org/10.3390/s20174902

AMA Style

Shang F, Wei B, Liu Y, Liu H, Wang S, Jiao L. Stochastic Recursive Gradient Support Pursuit and Its Sparse Representation Applications. Sensors. 2020; 20(17):4902. https://doi.org/10.3390/s20174902

Chicago/Turabian Style

Shang, Fanhua, Bingkun Wei, Yuanyuan Liu, Hongying Liu, Shuang Wang, and Licheng Jiao. 2020. "Stochastic Recursive Gradient Support Pursuit and Its Sparse Representation Applications" Sensors 20, no. 17: 4902. https://doi.org/10.3390/s20174902

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop