*5.2. Sparsification of Nature Images*

A classic sparsifying transform learning model [21] is formulated as

$$\min\_{\mathbf{W}, \mathbf{Y}} \|\mathbf{W}\mathbf{X} - \mathbf{Y}\|\_F^2 - \lambda \log \det \mathbf{W} + \mu \|\mathbf{W}\|\_F^2$$
 
$$\text{s.t. } \|\mathbf{Y}\_i\|\_0 \le s \,\forall \, i,\tag{28}$$

where **X** is the training data, **Y** are the sparse coefficients, and **W** is the learnt transform. The quality of the learnt transforms in the experiment [21] was judged based on their condition number and sparsification error. Similar to the experimental setting in [21], we also evaluated the effectiveness of the transforms learnt from our DRTPF by their condition number and sparsification error. The **l**2-norm condition number of the transform operator **Φ** is denoted as the ratio of the maximum singular value to the minimum singular value of **Φ**; that is,

$$
\mathcal{K}\_{\Phi} = \frac{\delta\_{\max}(\Phi)}{\delta\_{\min}(\Phi)}.\tag{29}
$$

In our case, the condition number K**<sup>Φ</sup>** = 1, as the maximum and minimum singular values (which are determined by the optimal frame bounds) must be equal to 1. Similarly, we can obtain that K**<sup>Ψ</sup>** = 1. It is the best case when the transform operators have condition number equal to 1. The sparsification error of the model (28) is defined as

$$SE = \|\mathbf{WX} - \mathbf{Y}\|\_F^2. \tag{30}$$

Similarly, we define the 'sparsification error' of the proposed DRTPF, to measure the energy loss due to sparse representation, which is formulated as

$$\widehat{SE} = \|\mathbf{Y} - \mathcal{S}\_{\lambda}(\mathbf{Y}^T\mathbf{X})\|\_F^2. \tag{31}$$

The 'sparsification error' indicates the compact ability of the transform **Ψ** with reasonable ignorance of the thresholding operator S*λ*(·).

**Figure 4.** The test images 'Barbara', 'Lena', 'Hill', 'Couple', 'Boat', and 'Man'.

To demonstrate that our model and algorithms are insensitive to the initialized transforms, we applied the proposed sparse coding and transform operator pair learning algorithms to train a pair of transforms. The training data are patches of size 10 × 10 extracted from the image 'Babara' which is shown in Figure 4. The trained transform pair are of size 100 × 200. We extracted the patches with non-overlap and removed the DC values of every sample. We set the parameters *η*<sup>1</sup> = 1.1 and *η*<sup>3</sup> = 1*e* + 7, and *η*<sup>2</sup> was replaced by the -<sup>0</sup> thresholding 0.6*σ*, as before. The matrices used for initialization were the 1D DCT matrix, the matrix with random columns sampled from the training data, and the redundant identity matrix. As the transform for DRTPF is redundant, the redundandt identity matrix here is formed as [**I I**] where **I** is the identity matrix of size 100 × 100.

The convergence curve of the objective function and the 'sparsification error' are shown in Figure 5. From the left sub-figure of Figure 5 we know that our proposed algorithm for DRTPF is converged, and all the initializations converge to the same result after about 20 iterations which demonstrate that our proposed DRTPF and the corresponding algorithm are insensitive to different initializations. The right sub-figure of Figure 5 shows the 'sparsification error' of the three initialized methods, the 2D DCT transform of and the KLT transform. The 2D DCT is formed by the Kronecker product of two 1D DCT transform, i.e., **D** = **D**<sup>0</sup> ⊗ **D**0, where **D**<sup>0</sup> is the 1D DCT transform of size 8 × 8 and ⊗ denotes the Kronecker product. The KLT transform **K** of size 64 × 64 is obtained by principle component analysis (PCA) method. The 'sparsification error' of 2D DCT and KLT are calculated via the model in [21] at iteration zero. This figure shows that the 'sparsification error' of the proposed DRTPF model is also converged and insensitive to the initialization matrices. In fact, the loss function of the proposed DRTPF mainly contains two partions: **<sup>X</sup>** − **<sup>Φ</sup><sup>Y</sup>**<sup>2</sup> *<sup>F</sup>* and **<sup>Y</sup>** − S*λ*(**Ψ***T***X**)<sup>2</sup> *<sup>F</sup>*. The first partion is the recovery loss (i.e., the loss in temporal domain) and the second partion is the 'sparsification error' (i.e., the loss in frequency domain). Our proposed model aims to achieve low error both in temporal domain and frequency domain.

To illustrate the behavior of the proposed DRTPF in image representation, we choose six images shown in Figure 4 to train transforms and recover images. The Figure 6 shows the average sparsity curve and the recovery PSNR values with the increase of the sparsity. From the left sub-figure we know that the images are well sparsified along the iterative process. This figure is generated by setting **y***<sup>i</sup>*<sup>0</sup> < 5 and the recovery PSNR is 32.27 dB. For each sample **x***<sup>i</sup>* vectorized by a 10 × 10 patch, its correspondinge sparse coefficients **y***<sup>i</sup>* is of length 200. It is easy to know that the sparsity rate is lower than 2.5%. Furthermore, less than 5% of the data need to be stored to recover an image with PSNR larger than 32.27 dB. The right sub-figure of Figure 6 shows the average recovery PSNR values with the increase of the sparsity which is a main measurement for the quality of the learnt transform. From the figure we know that in most of the case, our proposed DRTPF can obtain a better image quality in terms of PSNR with lower sparsity than the compared LST [21] method and the classic DCT transform. The ransform for LST [21] method and the classic DCT transform are of size 64 × 64. The transform of LST [21]is trained by 4096 8 × 8 samples extracted from every image shown in Figure 6 with the main of the patches removed. The experiment is set as them illustrated in the paper [21]. When the total sparsity of a 512 × 512 image is more than 47,000, the recovery results of the proposed DRTPF and the LST [21] are nearly the same. The recovery PSNR at sparsity 47,000 is 37.3 dB.

**Figure 5.** Convergence Curve and Sparsification Error. **Left**: The *X*-label is the iteration number. **Right**: The *Y*-label is the objective function and the sparsification error, respectively. It can be seen that our DRTPF learning algorithm is convergent and insensitive to initialization.

**Figure 6.** The average Sparsity and Recovery PSNR. **Left**: The *X*-label is the iteration number and the *Y*-label is the average sparsity. **Right**: The *X*-label is the average sparsity and the *Y*-label is the average Recovery PSNR.

#### *5.3. Image Denoising*

In this subsection, we evaluate the performance of our DRTPF model using six natural images of size 512 × 512, which are shown in Figure 4. We added Gaussian white noise to these images at different noise levels (*σ* = 20, 30, 40, 50, 60). We set the parameters *η*<sup>1</sup> = 1.1 and *η*<sup>3</sup> = 1*e* + 7, and *η*<sup>2</sup> was replaced by the -<sup>0</sup> thresholding 0.6*σ*, as before. We compared DRTPF with the three most related methods of sparse representation: K-SVD [14], the overcomplete transform (T.KSVD) [3], the learning-based frame (DTF [33]), the BM3D [35] and WNNM [36]. The BM3D and WNNM are nonlocal-based methods with the parameters setting as in corresponding paper. We note that DTF works on filters, instead of image patches. In the experiment, our DRTPF method and K-SVD were the same as in Section 5.1. All methods were trained iteratively (25 times). The DTF method was initialized by 64 3-level Harr wavelet filters of size 16 × 16. The operator size of the T.KSVD method was 128 × 64 and the patch size it worked on was 8 × 8 overlapping mean-subtracted patches. The hard thresholding was *s* = 30.

Table 1 shows the comparison results, in terms of average PSNR. As shown in Table 1, our DRTPF method and the DTF method outperformed K-SVD and T.KSVD on most images, i.e., our proposed DRTPF outperforms K-SVD for 0.47 dB and outperforms T.KSVD for 0.76 dB at noise level *σ* = 60. This result implies that methods using frames are more robust against noise. Furthermore, the higher the noise level, the better the results of DRTPF method and the DTF method than K-SVD and T.KSVD. We can also see that our DRTPF method outperformed DTF on most of the images, especially when the noise level was very high. In fact, in our model, the sparse coefficients are calculated accurately by the inner product of the signals and the frame **Ψ**, and are limited to a certain range. Theoretically,

it should perform better than the compared method. Figure 7 shows two exemplified visual results on the images 'Boat' and 'Man' at noise level *σ* = 40. The PSNR of the K-SVD, T.KSVD, DTF, and the proposed DRTPF are 27.17 dB, 26.14 dB, 26.99 dB, 27.34 dB for 'Man' and 27.23 dB, 26.45 dB, 27.20 dB and 27.39 dB for 'Boat'. Our proposed DRTPF and the DTF method provide more features and higher PSNR values of the two images than K-SVD and T.KSVD. Though the DTF provides higher PSNR values than K-SVD and T.KSVD, and better visual performance, the results of this method suffer from deformation and margin smoothing as it based on filter. The proposed DRTPF shows much clearer and better visual results than the other competing methods without any deformation.


**Table 1.** Average PSNR results of different noise levels on six images.

All the six methods can be classified to two categories (1) without any extra constraint, e.g., nonlocal similarity, and (2) with additional prior like nonlocal similarity. Our proposed DRTPF belongs to category (1). We would like to point out that our goal was to establish a redundant transform

learning method but not focus on image denoising. Our model is plain without applying any extra prior, besides the basic sparsity characteristics of the signals. The experimental results demonstrate that our proposed models can achieve better performance than traditional sparse models in image denoising. However, the methods BM3D and WNNM are based on image nonlocal self-similarity (NSS). The NSS prior refers to the fact that for a given local patch in a natural image, one can find many similar patches to it across the image. Intuitively, by stacking nonlocal similar patch vectors into a matrix, this matrix should be a low-rank matrix and have sparse singular values. The exploitation of NSS has been used to significantly boost image denoising performance. We have not involved this prior into our model.

**Figure 7.** Visual comparison of reconstruction results by different methods on 'Man' and 'Boat'. From left to right: original, T.KSVD [3], K-SVD [14], DFT [33], and DRTPF.

#### **6. Conclusions**

In this paper, we propose a Parseval frame-based data-driven overcomplete transform (DRTPF) to capture features of images. We also propose the corresponding formulations, as well as algorithms for calculating the sparse coefficients and DRTPF model learning. We have proposed a general frame learning method without imposing any structure on the frame. By applying frames to redundant transforms, we combine the ideas of analysis and synthesis sparse models and let them share almost identical sparse coefficients. We conducted robustness analysis, sparsification of nature image and image denoising experiments, which demonstrated that DRTPF can outperform state-of-the-art models, as it exploits the underlying sparsity of natural signals by the integration of frames and sparse models.

In future work, we shall consider more efficient optimization algorithms for DRTPF, which facilitate the representation ability and application of the proposed method.

**Author Contributions:** M.Z. derived the theory, analyzed the data, performed the performance experiments, and wrote the original draft; Y.S. and N.Q. researched the relevant theory, participated in discussions of the work, and revised the manuscript; B.Y. supervised the project. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by NSFC (No.61672066,61976011,U1811463, U19B2039, 61632006, 61906008, 61906009), Beijing municipal science and technology commission (No.Z171100004417023) and Common Program of Beijing Municipal Commission of Education (KM202010005018).

**Acknowledgments:** This work was supported by Beijing Advanced Innovation Center for Future Internet Technology, Beijing Key Laboratory of Multimedia, and Intelligent Software Technology. We deeply appreciate the organizations mentioned above.

**Conflicts of Interest:** The authors declare no conflict of interest.
