Next Article in Journal
Complementary Polarizer SOT-MRAM for Low-Power and Robust On-Chip Memory Applications
Previous Article in Journal
Explainable AI in Manufacturing and Industrial Cyber–Physical Systems: A Survey
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hybrid Transformer and Convolution for Image Compressed Sensing

by
Ruili Nan
,
Guiling Sun
*,
Bowen Zheng
and
Pengchen Zhang
College of Electronic Information and Optical Engineering, Nankai University, Tianjin 300350, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(17), 3496; https://doi.org/10.3390/electronics13173496
Submission received: 23 July 2024 / Revised: 9 August 2024 / Accepted: 2 September 2024 / Published: 3 September 2024

Abstract

:
In recent years, deep unfolding networks (DUNs) have received widespread attention in the field of compressed sensing (CS) reconstruction due to their good interpretability and strong mapping capabilities. However, existing DUNs often improve the reconstruction effect at the expense of a large number of parameters, and there is the problem of information loss in long-distance feature transmission. Based on the above problems, we propose an unfolded network architecture that mixes Transformer and large kernel convolution to achieve sparse sampling and reconstruction of natural images, namely, a reconstruction network based on Transformer and convolution (TCR-Net). The Transformer framework has the inherent ability to capture global context through a self-attention mechanism, which can effectively solve the challenge of long-range dependence on features. TCR-Net is an end-to-end two-stage architecture. First, a data-driven pre-trained encoder is used to complete the sparse representation and basic feature extraction of image information. Second, a new attention mechanism is introduced to replace the self-attention mechanism in Transformer, and a hybrid Transformer and convolution module based on optimization-inspired is designed. Its iterative process leads to the unfolding framework, which approximates the original image stage by stage. Experimental results show that TCR-Net outperforms existing state-of-the-art CS methods while maintaining fast computational speed. Specifically, when the CS ratio is 0.10, the average PSNR on the test set used in this paper is improved by at least 0.8%, the average SSIM is improved by at least 1.5%, and the processing speed is higher than 70FPS. These quantitative results show that our method has high computational efficiency while ensuring high-quality image restoration.

1. Introduction

Compressed sensing (CS) is a new information collection method. The sparse signal x R N can be reconstructed with high probability through measurement values whose dimensions are much smaller than the original signal dimension N [1]. CS research mainly focuses on two aspects: (1) designing efficient sampling matrices; (2) building high-quality reconstruction solvers to recover high-dimensional original signals from low-dimensional measurements. The research focus of this paper is on the latter. The applications of CS technology include but are not limited to wireless sensor networks [2], medical imaging [3], single-pixel imaging [4], etc., because it greatly alleviates the information collection problem’s demand for high transmission bandwidth and large storage space and at the same time can recover with high probability. Mathematically, in the sampling stage, the image x R N is quickly sampled to obtain a linear random measurement value y = Φ x M , where Φ M × N is the measurement matrix, M < < N , and the sampling rate is M / N . The reconstruction stage uses the low-dimensional measurement value y to restore the original image x . It is not difficult to see that the inverse problem is underdetermined, so there are infinite solutions to the problem in theory. To obtain reliable reconstructions, traditional CS methods usually solve for an energy function:
arg min x ( 1 2 Φ x y 2 2 + α F ( x ) ) .
In Equation (1), 1 2 Φ x y 2 2 represents the data fidelity term, which measures the similarity between the reconstructed image and the original image, and  α F ( x ) represents the prior term with regularization parameter α . Due to the underdetermination of the inverse problem, for traditional CS methods, the prior term can be a sparse operator corresponding to some predefined transformation bases, such as wavelet transform and discrete cosine transform (DCT) [5,6]. In most cases, they show strong convergence and are supported by theoretical analysis, but their use commonly faces the limitations of high computational complexity and low adaptability [7]. In recent years, due to the powerful learning ability of deep neural networks, a series of image CS methods based on deep networks have been proposed [8,9]. These methods relax the assumptions on the sparsity of the original image and jointly optimize the sampling matrix and nonlinear recovery operator [10,11] so that the relationship between them can be measured through end-to-end training, coordinate with each other, learn the structure and texture features of the image more effectively, and greatly improve the efficiency and quality of image CS reconstruction. Among them, deep unfolding networks (DUNs) have received widespread attention due to their good interpretability and strong mapping.
However, existing DUNs are often limited by models and are prone to feature information loss during the iteration process; in addition to the local features of the image captured by a CNN, the global position information of the image is also important. It is difficult to comprehensively learn the global information of the image using only a simple CNN due to the limitations of stacked convolutional layers, which result in over-parameterization and the problem of redundant filters. These issues naturally constrain the effectiveness of the receptive field, potentially limiting the performance of image reconstruction [12,13].
This paper aims to use the inherent ability of the Transformer framework to capture global context to effectively solve the above problems. As the core module of Transformer, self-attention [14] was originally designed for 1D natural language processing (NLP) tasks and treats 2D images as 1D sequences, which destroys the key 2D structure of the image. This raises the question of how to design CS reconstruction models suitable for images to meet the requirements of global correlation modeling.
To solve the problem, this paper proposes an optimization-inspired hybrid Transformer and convolution module (TC) as an iterative process and establishes a TC-based image CS unfolding framework, which is shown in Figure 1, to achieve joint optimization of image sparse sampling and reconstruction, namely, a reconstruction network based on Transformer and convolution (TCR-Net). In the TC module, in order to reduce the feature loss problem caused by the inherent structure of the expanded network, an information transmission path is built between adjacent stages; meanwhile, a new dual-channel large kernel attention (Dual-LKA) is designed to replace the original self-attention in Transformer; Dual-LKA absorbs the advantages of convolution and self-attention, including local structure information, long-range dependence, and adaptability. At the same time, it avoid the shortcomings of ignoring the adaptability of channel dimensions. In general, our TCR-Net has good interpretability. Moreover, it can make up for the common problems of information loss and incomplete global feature acquisition in DUNs, ensuring the integrity of the information to a greater extent, which is conducive to more accurate and faster image CS reconstruction.
In summary, our main contributions are summarized as follows:
  • Combining Transformer and large kernel convolution, an optimization-inspired TC module is designed and its iterative process is used to construct a novel image CS unfolding framework, TCR-Net, which realizes the joint optimization of image sparse sampling and reconstruction.
  • In order to extract feature information more completely, an information transfer path is built between neighboring TC modules. Meanwhile, a new dual-channel large kernel attention mechanism (Dual-LKA) is proposed to deal with contextual information efficiently while retaining local descriptions, which integrates the advantages of convolution and self-attention, and avoids their drawbacks, and makes TCR-Net more suitable for CS reconstruction of images.
  • Extensive experiments demonstrate that our proposed TCR-Net outperforms existing state-of-the-art CS methods while maintaining fast computational speed.
This paper is organized as follows: Section 1 mainly introduces the research background and existing research results of image CS then briefly introduces the main contributions. Section 2 introduces the related work of CS. Section 3 elaborates on our proposed TCR-Net. Section 4 shows the comparative experiments between TCR-Net and existing advanced algorithms and analyzes and discusses the experimental results. Section 5 summarizes the full paper and looks forward to the future work.

2. Related Work

2.1. Deep Unfolding Network

The idea of DUNs is to cascade traditional iterative optimization algorithms through neural networks. DUNs have good interpretability on the training data pair { ( y i , x i ) } i = 1 N a , which are usually formulated as a two-layer optimization problem in the CS structure:
min Θ i = 1 N a L ( x ^ i , x i ) , s . t . x ^ i = arg min x ( 1 2 Φ x y i 2 2 + α F ( x ) ) .
In recent years, some optimization methods, such as iterative shrinkage threshold algorithm (ISTA) [15], approximate message passing (AMP) [16], etc., have been continuously developed [17,18]. However, from the perspective of the structure of the neural network, the inherent transmission path of DUNs is prone to information feature loss, each stage is greatly affected by adjacent stages, and the ability to obtain global context information is poor.

2.2. Transformer

Inspired by the success of Transformers [19] in NLP, researchers began to study extending the Transformer structure to various computer vision tasks [20], such as segmentation [21], object detection [22], and image restoration [23]. Despite its success, its core module, self-attention, still has its shortcomings [24]. In addition to destroying the critical 2D structure of the image as mentioned above, high-resolution image processing is also difficult due to secondary computation and memory overhead. In addition, self-attention is a special attention that only considers the adaptation of the spatial dimension and ignores the adaptation of the channel dimension, which is also important for visual tasks [25]. Therefore, the practical application of Transformer in visual tasks requires further exploration. Due to the replaceability of self-attention in visual tasks [26], an attention mechanism is designed that is more suitable for refined image restoration, taking into account local structural information and long-range dependencies while ensuring that the network is adaptive in the spatial and channel dimensions.

3. Proposed Method

In this section, we will describe the proposed TCR-Net in detail.

3.1. Overall Architecture

Considering simplicity and interpretability, we follow [27] and directly expand the traditional proximal gradient descent (PGD) [28] to solve Equation (1) and express it as an iterative function, namely, Equation (3):
x ^ ( k ) = arg min x ( 1 2 x ( x ^ ( k 1 ) λ ( k ) g ( x ^ ( k 1 ) ) 2 2 + α F ( x ) ) ,
where x ^ ( k ) represents the output of the k-th iteration, and  g ( · ) represents the data fidelity term in Equation (1). ∇ is a differential operator weighted by the step size λ ( k ) .
In a practical sense, inspired by OPINE-Net [17], which is a DUN unfolded by ISTA, Equation (3) can be divided into two sub-problems, gradient descent (GD, Equation (4)) and proximal mapping (PM, Equation (5)), which is actually a CNN-based denoiser. Different from OPINE-Net, we use PM with generalized design, which is different from the hand-crafted l 1 (in OPINE-Net, the prior term is defined as l 1 norm, p r o x α , F ( r ( k ) ) = s i g n ( r ( k ) ) max ( 0 , | r ( k ) | α ) , l 1 norm makes model training prefer to select relatively few features) in that it has wider representation capabilities and is easily extended to other degenerate tasks.
r ( k ) = x ^ ( k 1 ) λ ( k ) Φ T ( Φ x ^ k 1 y ) ,
x ^ ( k ) = p r o x α , F ( r ( k ) ) = arg min x ( 1 2 x r ( k ) 2 2 + α F ( x ) ) .
Iteratively update r ( k ) and x ^ ( k ) until convergence, where r ( k ) represents an intermediate variable in the k-th iteration.
Therefore, the iterative process of TCR-Net can be briefly expressed as Equation (6), k [ 1 , 2 , , N s ] , where N s represents the number of TC modules, that is, the number of network stages.
x ^ ( k ) = H P M ( k ) ( x ^ ( k 1 ) λ ( k ) Φ T ( Φ x ^ k 1 y ) ) .

3.2. Architecture Design of TCR-Net

In this section, we will elaborate on the structural design and related theories of TCR-Net. As shown in Figure 1, the network is divided into two stages. The first is adaptive sparse sampling of images to obtain measurements; the second is inverse problem solving to achieve end-to-end inverse mapping of measurements to original images, including initial reconstruction (IR) and deep reconstruction (DR).

3.2.1. Sampling and Initial Reconstruction

In order to obtain better image reconstruction results, a data-driven pre-trained encoder is used to complete the sparse representation of image information with basic feature extraction. Specifically, we first divide the image X into non-overlapping blocks { x i } of size C × B × B , where B denotes the block size and C denotes the image channels, and in the paper, we set C = 1 by default. In order not to destroy the 2D structure of the image, convolution without bias is used to implement the sampling. Assuming that the measurement matrix is Φ M × N ( M N ) , then M convolution kernels of size 1 × N × N are used to complete the convolution with { x i } (simplified as x ). This can achieve sparse sampling of the image and obtain the measurements { y i } (simplified as y ). Using block-by-block sampling, set the convolution step size to P, i.e.,  P = N . In general, B is an integer multiple of P. Expressing the sampling module as S ( · ) , then y = S ( x ) .
Use Φ T to achieve the IR of the image so that no additional parameters need to be introduced. That is, Φ is transposed into N convolution kernels of size M × 1 × 1 and convolved with y with a step size of 1 to obtain the initial reconstruction x ^ ( 0 ) . Express the IR module as I ( · ) , then x ^ ( 0 ) = I ( S ( x ) ) .

3.2.2. Deep Reconstruction

The DR is iterated by the TC module N s , times. The structure of the k-th TC module is shown in Figure 2. Each TC module contains GD and PM. The k-th TC module has inputs x ^ ( k 1 ) and z ( k 1 ) and outputs x ^ ( k ) and z ( k ) . The two tensors are iteratively updated. Restricted by the structure of DUNs, the input and output of each stage are a single-channel image, and utilizing 3 × 3 convolution in PM to achieve multi-channel to single-channel feature conversion belongs to the lossy conversion process, which leads to the loss of image details. Therefore, by utilizing the output z ( k ) of the feed-forward neural network, the information transmission path is built directly between adjacent stages and transmitted to the next stage, which can effectively avoid the information loss due to the channel shrinkage conversion.
The symmetry of ξ ( · ) and ζ ( · ) helps the model better understand and utilize the input symmetry information. Since the sparsity or structure of information in compressed sensing is often related to symmetry, this design helps information transfer and feature extraction.
Attention module A ( · ) is the core of TC module design. In Transformer, it uses the self-attention mechanism to capture long-range dependencies and plays an increasingly important role in computer vision [29,30]. However, as described in Section 2, it has obvious shortcomings that cannot be ignored. The attention mechanism can be viewed as an adaptive selection process, which can select discriminative features based on input features and automatically ignore noisy responses. The key steps of the attention mechanism are to generate an attention graph representing the importance of different parts and to learn the relationship between different features. Of course, we can also use large kernel convolutions [31,32] to build correlations and generate attention maps. This approach still has obvious shortcomings. Large kernel convolution brings a lot of computational overhead and parameters. In order to overcome the above shortcomings and take advantage of self-attention and large kernel convolution, we utilize decomposable large kernel convolution to capture long-range relationships.
Guo et al. [33] proved through detailed experiments that large kernel convolution can be effectively decomposed into a combination of three convolutions, namely, depth-wise convolution, depth-wise dilation convolution, and 1 × 1 convolution. At the same time, an attention mechanism is involved in the decomposition process; in other words, decomposable large kernel convolution provides a similar feeling field to the self-attention mechanism. Through decomposition, we can capture long-range relations with small computational cost and parameters and thus estimate the importance of individual data points and generate the corresponding attention graphs.
Specifically, we decompose a K × K convolution into a ( 2 d 1 ) × ( 2 d 1 ) depth-wise convolution, and a  K / d × K / d depth-wise dilation convolution with dilation d, a 1 × 1 convolution, where K is the size of the convolution kernel. Thus, the number of parameters P ( K , d ) and floating-point operations (FLOPs) P ( F , d ) can be denoted as follows for the inputs with dimensions H × W and channel C:
P ( K , d ) = ( ( 2 d 1 ) 2 + K / d 2 + C + 3 ) · C ,
F ( K , d ) = P ( K , d ) × H × W .
From Equations (7) and (8), it can be found that P ( K , d ) increases quadratically with K and C, and  F ( K , d ) grows linearly with P ( K , d ) and image size. When the reconstruction object is determined, the H, W, and C of the image are fixed, and the computational cost can be reduced by reducing P ( K , d ) . Therefore, in order to minimize P ( K , d ) for a fixed kernel size K and reduce the network computing cost, the derivative of Equation (7) with respect to d is set to zero, where K / d is approximately equal to K / d .
d P ( K , d ) ! =   0 = ( 8 d 2 K 2 d 3 4 ) · C .
In order to make A ( · ) have richer multi-scale information, a multi-scale mechanism is introduced and a Dual-LKA is designed. Its structure diagram is shown in Figure 3. After the input features undergo 1 × 1 convolution and GELU activation function, two large kernel decompositions of different sizes are applied to it, that is, K = 9 and K = 27 are set, respectively. According to the solution of Equation (9), when K = 27, d 3.8 ; when K = 9, d 2.3 , so d is set to 2 and 4, respectively. Therefore, the related kernel parameters are (7, 7, 1) and (3, 5, 1), respectively.
Expressing the TC module as D ( · ) and the convolution operation as C ( · ) , the iterative process of the DR network can be expressed as:
x ^ ( k ) , z ( k ) = D ( x ^ ( k 1 ) , z ( k 1 ) ) ,
among them, z ( 0 ) = C ( x ^ ( 0 ) ) . After  N s iterations, the final reconstructed image x ^ ( N s ) can be obtained.
In summary, the complete implementation process of TCR-Net is summarized in Algorithm  1. | | represents concatenation by channel dimension, and  C k × k ( · ) represents k × k convolution.
Algorithm 1: TCR-Net for Image Compressed Sensing.
  • Input:  x , initialize the iteration depth k = 1 and ceiling N s , learnable Φ and λ ( k )
  • Output:  x ^ ( N s )
  • Adaptive Sampling:  y = S ( x )
  • Reconstruction:
  • Initialization: x ^ ( 0 ) = I ( S ( x ) ) , z ( 0 ) = C 3 × 3 ( x ^ ( 0 ) )
  • for  k = 1 , , N s  do
  •      r ( k ) = x ^ ( k 1 ) λ ( k ) Φ T ( Φ x ^ k 1 y )
  •      b ( k ) = ξ ( C 3 × 3 ( r ( k ) | | z ( k 1 ) ) ) + C 3 × 3 ( r ( k ) | | z ( k 1 ) )
  •      c ( k ) = A ( L N ( b ( k ) ) ) + b ( k )
  •      z ( k ) = ζ ( L N ( c ( k ) ) ) + c ( k )
  •      x ^ ( k ) = C 3 × 3 ( z ( k ) ) + r ( k )
  • Return  x ^ ( N s )

3.3. Loss Function and Network Parameters

Given the training dataset { x i } and the CS ratio γ , first input x i into TCR-Net to complete adaptive sampling and reconstruction in sequence and output the final restored image, which is x ^ i ( N s ) ; secondly, use MSE to measure the difference between x i and x ^ i ( N s ) ( N s is the number of phases of DR).
L = x ^ i ( N s ) x i 2 2 .
The TCR-Net proposed in this article is an end-to-end mapping network that can be learned throughout the entire process. All involved parameters (such as measurement matrices, nonlinear transformations, etc.) are learned using end-to-end backpropagation, which has the advantage of fast and accurate reconstruction performance and explicit interpretability.
Specifically, the learnable parameter set Θ in TCR-Net includes Φ , λ , ξ ( · ) , A ( · ) , ζ ( · ) , and several convolution modules C k × k ( · ) at different scales, i.e., Θ = { Φ , λ , ξ ( · ) , A ( · ) , ζ ( · ) ,
C k × k ( · ) } . Note that the same parameters are shared in all TC modules except λ .
In TC, the first C 3 × 3 ( · ) has C+1 input channels and C output channels, and the last C 3 × 3 ( · ) has C input channels and one output channel. After ξ ( · ) , A ( · ) , and ζ ( · ) , the number of channels remains constant at C. C 3 × 3 ( · ) in z ( 0 ) = C 3 × 3 ( x ^ ( 0 ) ) has one input channel and C output channels.

4. Experiment

4.1. Experimental Settings

4.1.1. Datasets and Performance Measures

Our training dataset uses train (200 images) and test (200 images) from BSD500 [34], and the validation dataset is Set11 [35]. All images in the training dataset are randomly cropped into 200 (96 × 96) sub-images, so the training dataset contains a total of 80,000 training sub-images. During testing, the optimal model is selected and the algorithm is tested on four widely used benchmark datasets, Set5 [36], McM18 [37], BSD100 [34], and General100 [38].
In order to measure the performance of each algorithm, two commonly used metrics, peak signal-to-noise ratio (PSNR) and structural similarity (SSIM), are used to comprehensively evaluate the quality of image reconstruction. PSNR represents the peak signal-to-noise ratio between the reconstructed image and the original image, which is used to quantify the image quality. SSIM evaluates the structural similarity between the reconstructed image and the original image, reflecting the visual quality of the image. We also show parameters, model size, and average computational time of each method to measure their performance.

4.1.2. Implementation Details

Using different CS ratios for adaptive sampling and reconstruction, γ = { 0.01 , 0.04 , 0.10 ,
0.25 , 0.50 } and a corresponding adaptive measurement matrix Φ M × N are obtained. In TCR-Net, B = 96, P = 32, so N = 1024 and M = γ · N in Φ . The training in this subsection was conducted on RTX 4090 (24 GB), Python with PyTorch version 1.11.0. All tests and ablation studies were conducted on an Intel Xeon(R) W-2145 CPU plus an NVIDIA Quadro RTX 4000 GPU. TCR-Net is trained for 100 epochs, the batch size is 16, and the number of feature map channels C is 32. We use the Adam [39] optimizer to train the network with the initial learning rate of 4 × 10−5, which decreased to 5 × 10−5 through 100 epochs using the cosine annealing strategy [40], and the number of warm-up epochs is 3. It is worth noting that color images are processed in the YCbCr space and evaluated on the Y channel. For the number of stages, considering the device performance, we set the number of stages N s to 7 for a better trade-off between model performance and complexity.

4.2. Comparisons with State-of-the-Art Methods

In order to evaluate the performance of the proposed TCR-Net, it is compared with existing representative CS methods (including ISTA-Net+ [41], OPINE-Net+ [17], AMP-Net-BM [18], DGU-Net+ [42], TransCS [43], TCS-Net [44]) in terms of reconstruction quality and algorithm complexity. These comparison algorithms all belong to DUNs. ISTA-Net+ uses Gaussian random matrix (GRM), and the other algorithms all use learnable measurement matrices. TransCS and TCS-Net both introduce the Transformer structure.
Table 1 shows the experimental comparison on datasets at multiple CS ratios. It can be observed from Table 1 that, in all cases, our TCR-Net outperforms all other competing methods in terms of PSNR and SSIM. Avg. in Table 1 represents the average reconstruction quality of the image at a certain CS ratio of each algorithm. The calculation method is as follows:
p = i = 1 D p i · n i / i = 1 D n i s = i = 1 D s i · n i / i = 1 D n i ,
In Equation (12), p i and s i represent the PSNR and SSIM of the i-th dataset, respectively. n i represents the number of images in the i-th dataset, and D represents the number of datasets. In our paper, D = 4. Therefore, p and s represent the average PSNR and SSIM of four datasets at a certain CS ratio for a certain algorithm. For example, when the CS ratio is 0.10, TCR-Net outperforms ISTA-Net+ [41], OPINE-Net+ [17], AMP-Net-BM [18], DGU-Net+ [42], TransCS [43], and TCS-Net [44] by 3.6357 dB, 0.6118 dB, 0.8164 dB, 0.2457 dB, 0.7604 dB, and 1.3982 dB in terms of PSNR, respectively. Figure 4 further shows the visual comparison at a CS ratio of 0.10. It can be seen that our TCR-Net can obtain clearer and more accurate reconstructed images than others.
For algorithm complexity, Figure 5 shows the real image reconstruction performance of each algorithm under different parameter capacities on Set5 [36]. Meanwhile, Table 2 shows the comparison of parameters, model sizes, and computational time for reconstructing a 256 × 256 image of different CS methods at a CS ratio of 0.10. Time includes the sampling and reconstruction process.
Moreover, TCR-Net takes an average of 14.3 milliseconds (ms) to reconstruct a 1920 × 1080 single-channel image with GPU; that is, the reconstruction speed is about 70 FPS, which is higher than the frame rate of high-definition video at 60 FPS. It has great advantages for processing tasks with limited resources and high time costs. In summary, TCR-Net is an effective CS image restoration method with high image reconstruction quality and good stability.

4.3. Ablation Studies and Discussions

4.3.1. Measurement Matrix

TCR-Net utilizes a learnable measurement matrix to achieve adaptive sampling; specifically, a data-driven pre-trained encoder is used to complete the sparse representation of an image and the extraction of basic features. Figure 6 shows the results of image CS reconstruction achieved by using the learnable measurement matrix and GRM, respectively, at a CS ratio of 0.10, and we can find that at least a 1.67 dB gain is obtained by using the learnable measurement matrix compared to the GRM, and the image quality is greatly improved. Meanwhile, Figure 7 shows a simple visualization of the two measurement matrices in the frequency domain, where the learned measurement matrix can adaptively assist and balance the amount of low-frequency and high-frequency information retained in the sensed measurements for better image reconstruction.

4.3.2. Dual-LKA

In this subsection, in order to prove that our Dual-LKA design is reasonable, comparative experiments between Dual-LKA and Single-LKA are conducted. The dual-channel parameters of Dual-LKA are (7, 7, 1) and (3, 5, 1), so Single-LKA is set to (7, 7, 1) and (3, 5, 1), respectively. Table 3 demonstrates the PSNR values of image CS reconstruction for each dataset at a sampling rate of 0.25, and the average PSNR is improved by 0.0994 dB and 0.1477 dB, respectively. Figure 8 shows the visualized comparison of baby _ GT in Set5. It can be found that the reconstructed image (eyelash root) is clearer and more accurate when Dual-LKA is used. In summary, the quality of image reconstruction can be really improved by using Dual-LKA.

4.3.3. Sensitivity to Noise

In practice, imaging models may be affected by noise, so to test the robustness of our designed TCR-Net, we first add Gaussian noise with different noise levels to the images of Set5 and BSD100. Then, the noisy images are used as inputs to each model input, and Set5 and BSD100 are sampled and recovered at a CS ratio of 0.25. Figure 9 shows the PSNR values for all methods plotted against various standard variance noises (mean = 0, standard deviation = Noise). It can be seen that our TCR-Net is robust to noise corruption.

5. Conclusions

Our paper innovatively proposes TCR-Net, a fully end-to-end learnable image CS framework which fully utilizes the interpretability and strong mapping of the DUNs to achieve high-quality, low-latency image reconstruction. TCR-Net simultaneously achieves joint optimization of adaptive sparse sampling and reconstruction of natural images. We design a Dual-LKA based on Transformer and large kernel decomposition convolution, which can effectively process contextual information while focusing on local features. On this basis, a TC module based on optimization-inspired is constructed, and an information transmission path is built between adjacent TC modules, successfully reducing the problem of feature loss caused by channel transformation. Experimental results show that, compared with other mainstream CS methods, TCR-Net has higher reconstruction performance and ability to perceive images. In the future, TCR-Net can be applied to MRI, satellite imaging, and video surveillance systems through network adaptive adjustment.

Author Contributions

Conceptualization, methodology, writing—original draft preparation, writing—review and editing, R.N.; funding acquisition, project administration, resources, G.S.; writing—review and editing, validation, formal analysis, B.Z.; writing—review and editing, validation, P.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Tianjin Natural Science Foundation grant number 21JCZDJC00340 and National Natural Science Foundation of China grant number 61771262, 61901233.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Donoho, D.L. Compressed sensing. IEEE Trans. Inf. Theory 2006, 52, 1289–1306. [Google Scholar] [CrossRef]
  2. Tian, X.; Wei, G.; Wang, J. Target location method based on compressed sensing in hidden semi Markov model. Electronics 2022, 11, 1715. [Google Scholar] [CrossRef]
  3. Fei, T.; Feng, X. Learing sampling and reconstruction using Bregman iteration for CS-MRI. Electronics 2023, 12, 4657. [Google Scholar] [CrossRef]
  4. Abedi, M.; Sun, B.; Zheng, Z. Single-pixel compressive imaging based on random DoG filtering. Signal Process. 2021, 178, 107746. [Google Scholar] [CrossRef]
  5. Zhao, C.; Ma, S.; Zhang, J.; Xiong, R.; Gao, W. Video compressive sensing reconstruction via reweighted residual sparsity. IEEE Trans. Circuits Syst. Video Technol. 2017, 27, 1182–1195. [Google Scholar] [CrossRef]
  6. Zhao, C.; Ma, S.; Gao, W. Image compressive-sensing recovery using structured laplacian sparsity in DCT domain and multi-hypothesis prediction. In Proceedings of the 2014 IEEE International Conference on Multimedia and Expo (ICME), Chengdu, China, 14–18 July 2014. [Google Scholar] [CrossRef]
  7. Zhao, C.; Zhang, J.; Ma, S.; Gao, W. Non-convex Lp Nuclear Norm based ADMM Framework for Compressed Sensing. In Proceedings of the 2016 Data Compression Conference (DCC), Snowbird, UT, USA, 30 March–1 April 2016; pp. 161–170. [Google Scholar] [CrossRef]
  8. Wang, B.; Lian, Y.; Xiong, X.; Zhou, H.; Liu, Z.; Das, M. Progressive feature reconstruction and fusion to accelerate MRI imaging: Exploring insights across low, mid, and high-order dimensions. Electronics 2023, 12, 4742. [Google Scholar] [CrossRef]
  9. Xie, Y.; Li, Q. A review of deep learning methods for compressed sensing image reconstruction and its medical applications. Electronics 2022, 11, 586. [Google Scholar] [CrossRef]
  10. Shi, W.; Jiang, F.; Liu, S.; Zhao, D. Scalable convolutional neural network for image compressed sensing. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12282–12291. [Google Scholar] [CrossRef]
  11. You, D.; Zhang, J.; Xie, J.; Chen, B.; Ma, S. COAST: Controllable arbitrary-sampling network for compressive sensing. IEEE Trans. Image Process. 2021, 30, 6066–6080. [Google Scholar] [CrossRef]
  12. Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. Understanding the effective receptive field in deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 4905–4913. [Google Scholar]
  13. Prakash, A.; Storer, J.; Florencio, D.; Zhang, C. RePr: Improved training of convolutional filters. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10658–10667. [Google Scholar] [CrossRef]
  14. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2021, arXiv:2010.11929v2. [Google Scholar]
  15. Donoho, D.L. De-noising by soft-thresholding. IEEE Trans. Inf. Theory 1995, 41, 613–627. [Google Scholar] [CrossRef]
  16. Donoho, D.L.; Maleki, A.; Montanari, A. Message-passing algorithms for compressed sensing. Proc. Natl. Acad. Sci. USA 2009, 106, 18914–18919. [Google Scholar] [CrossRef]
  17. Zhang, J.; Zhao, C.; Gao, W. Optimization-inspired compact deep compressive sensing. IEEE J. Sel. Top. Signal Process. 2020, 14, 765–774. [Google Scholar] [CrossRef]
  18. Zhang, Z.; Liu, Y.; Liu, J.; Wen, F.; Zhu, C. AMP-Net: Denoising-based deep unfolding for compressive image sensing. IEEE Trans. Image Process. 2021, 30, 1487–1500. [Google Scholar] [CrossRef] [PubMed]
  19. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5999–6099. [Google Scholar]
  20. Zhang, J.; Huang, Y.; Wu, W.; Lyu, M.R. Transferable adversarial attacks on vision transformers with token gradient regularization. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 16415–16424. [Google Scholar] [CrossRef]
  21. Wang, Y.; Xu, Z.; Wang, X.; Shen, C.; Cheng, B.; Shen, H.; Xia, H. End-to-end video instance segmentation with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 17378–17389. [Google Scholar] [CrossRef]
  22. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. In Proceedings of the 9th International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
  23. Wang, Z.; Cun, X.; Bao, J.; Zhou, W.; Liu, J.; Li, H. Uformer: A general U-shaped Transformer for image restoration. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17662–17672. [Google Scholar] [CrossRef]
  24. Zhang, J.; Huang, J.T.; Wang, W.; Li, Y.; Wu, W.; Wang, X.; Su, Y.; Lyu, M.R. Improving the transferability of adversarial samples by path-augmented method. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 8173–8182. [Google Scholar] [CrossRef]
  25. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
  26. Liu, H.; Dai, Z.; So, D.R.; Le, Q.V. Pay attention to MLPs. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021; pp. 9204–9215. [Google Scholar]
  27. Lefkimmiatis, S. Universal denoising networks: A novel CNN architecture for image denoising. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3204–3213. [Google Scholar] [CrossRef]
  28. Parikh, N.; Boyd, S. Proximal algorithms. Found. Trends Optim. 2014, 1, 127–239. [Google Scholar] [CrossRef]
  29. Xu, Y.; Wei, H.; Lin, M.; Deng, Y.; Sheng, K.; Zhang, M.; Tang, F.; Dong, W.; Huang, F.; Xu, C. Transformers in computational visual media: A survey. Comput. Vis. Media 2022, 8, 33–62. [Google Scholar] [CrossRef]
  30. Bello, I.; Zoph, B.; Vaswani, A.; Shlens, J.; Le, Q.V. Attention augmented convolutional networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3285–3294. [Google Scholar] [CrossRef]
  31. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
  32. Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. BAM: Bottleneck attention module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
  33. Guo, M.H.; Lu, C.Z.; Liu, Z.N.; Cheng, M.M.; Hu, S.M. Visual attention network. Comput. Vis. Media 2023, 9, 733–752. [Google Scholar] [CrossRef]
  34. Arbelaez, P.; Maire, M.; Fowlkes, C.; Malik, J. Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 898–916. [Google Scholar] [CrossRef]
  35. Kulkarni, K.; Lohit, S.; Turaga, P.; Kerviche, R.; Ashok, A. ReconNet: Non-iterative reconstruction of images from compressively sensed measurements. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 449–458. [Google Scholar] [CrossRef]
  36. Bevilacqua, M.; Roumy, A.; Guillemot, C.; Alberi-Morel, M.L. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of the 23rd British Machine Vision Conference (BMVC), Surrey, UK, 3–7 September 2012; p. 135. [Google Scholar] [CrossRef]
  37. Zhang, L.; Wu, X.; Buades, A.; Li, X. Color demosaicking by local directional interpolation and nonlocal adaptive thresholding. J. Electron. Imag. 2011, 20, 023016. [Google Scholar] [CrossRef]
  38. Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 391–407. [Google Scholar] [CrossRef]
  39. Kingma, D.; Ba, J. Adam: A method for stochastic optimization. arXiv 2015, arXiv:1412.6980. [Google Scholar]
  40. Loshchilov, I.; Hutter, F. SGDR: Stochastic gradient descent with warm restarts. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
  41. Zhang, J.; Ghanem, B. ISTA-Net: Interpretable optimization-inspired deep network for image compressive sensing. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1828–1837. [Google Scholar] [CrossRef]
  42. Mou, C.; Wang, Q.; Zhang, J. Deep generalized unfolding networks for image restoration. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17378–17389. [Google Scholar] [CrossRef]
  43. Shen, M.; Gan, H.; Ning, C.; Hua, Y.; Zhang, T. TransCS: A Transformer-based hybrid architecture for image compressed sensing. IEEE Trans. Image Process. 2022, 31, 6991–7005. [Google Scholar] [CrossRef] [PubMed]
  44. Gan, H.; Shen, M.; Hua, Y.; Ma, C.; Zhang, T. From patch to pixel: A Transformer-based hierarchical framework for compressive image sensing. IEEE Trans. Comput. Imaging 2023, 9, 133–146. [Google Scholar] [CrossRef]
Figure 1. Illustration of our TCR-Net framework, which contains adaptive compressed sampling and inverse mapping. Meanwhile, inverse mapping contains two steps; one is initial reconstruction (IR), the other is deep reconstruction (DR). In a block-based CS scheme, the original image X is divided into l non-overlapped B × B blocks { x i } , and they are sampled to obtain block measurements { y i } , which are then initialized by IR to obtain a joint recovered estimation X by DR. X ( k 1 ) and Z ( k 1 ) are the inputs of the k-th iterative process.
Figure 1. Illustration of our TCR-Net framework, which contains adaptive compressed sampling and inverse mapping. Meanwhile, inverse mapping contains two steps; one is initial reconstruction (IR), the other is deep reconstruction (DR). In a block-based CS scheme, the original image X is divided into l non-overlapped B × B blocks { x i } , and they are sampled to obtain block measurements { y i } , which are then initialized by IR to obtain a joint recovered estimation X by DR. X ( k 1 ) and Z ( k 1 ) are the inputs of the k-th iterative process.
Electronics 13 03496 g001
Figure 2. Illustration of the k-th TC module. It contains a gradient descent proximal mapping module. The PM module includes feature extraction module ξ ( · ) , attention module A ( · ) , and feed-forward network module ζ ( · ) . d means depth-wise convolution, k × k represents k × k convolution.
Figure 2. Illustration of the k-th TC module. It contains a gradient descent proximal mapping module. The PM module includes feature extraction module ξ ( · ) , attention module A ( · ) , and feed-forward network module ζ ( · ) . d means depth-wise convolution, k × k represents k × k convolution.
Electronics 13 03496 g002
Figure 3. Illustration of Dual-LKA, d represents depth-wise convolution, d d represents depth-wise dilation convolution, and k × k represents k × k convolution. A ( · ) does not change the number of channels.
Figure 3. Illustration of Dual-LKA, d represents depth-wise convolution, d d represents depth-wise dilation convolution, and k × k represents k × k convolution. A ( · ) does not change the number of channels.
Electronics 13 03496 g003
Figure 4. Visual comparison of all the competing CS algorithms at a CS ratio of 0.10.
Figure 4. Visual comparison of all the competing CS algorithms at a CS ratio of 0.10.
Electronics 13 03496 g004
Figure 5. Real image reconstruction performance (y-axis) of our TCR-Net and some recent methods (ISTA-Net+ [41], OPINE-Net+ [17], AMP-Net-BM [18], DGU-Net+ [42], TransCS [43], TCS-Net [44]) under different parameter capacities (x-axis) on Set5 [36].
Figure 5. Real image reconstruction performance (y-axis) of our TCR-Net and some recent methods (ISTA-Net+ [41], OPINE-Net+ [17], AMP-Net-BM [18], DGU-Net+ [42], TransCS [43], TCS-Net [44]) under different parameter capacities (x-axis) on Set5 [36].
Electronics 13 03496 g005
Figure 6. Reconstruction performance (PSNR/SSIM) with different measurement matrices at a CS ratio of 0.10.
Figure 6. Reconstruction performance (PSNR/SSIM) with different measurement matrices at a CS ratio of 0.10.
Electronics 13 03496 g006
Figure 7. The visualizations of the measurement matrix at a CS ratio of 0.10 in the frequency domain.
Figure 7. The visualizations of the measurement matrix at a CS ratio of 0.10 in the frequency domain.
Electronics 13 03496 g007
Figure 8. Visual comparison of baby _ GT .
Figure 8. Visual comparison of baby _ GT .
Electronics 13 03496 g008
Figure 9. Comparison of robustness to Gaussian noise.
Figure 9. Comparison of robustness to Gaussian noise.
Electronics 13 03496 g009
Table 1. Average PSNR (dB) and SSIM performance comparisons on datasets at multiple CS ratios.
Table 1. Average PSNR (dB) and SSIM performance comparisons on datasets at multiple CS ratios.
DatasetCS ratiosISTA-Net+OPINE-Net+AMP-Net-BMDGU-Net+TransCSTCS-NetTCR-Net
Set50.0118.5225/0.440821.8914/0.610122.4254/0.618522.4190/0.6237-/-22.7494/0.600323.0929/0.6367
0.0423.4528/0.661927.9457/0.820927.8246/0.817928.3861/0.831827.9142/0.826227.5483/0.817328.5561/0.8427
0.1028.6065/0.831532.5102/0.905832.1392/0.903132.8441/0.911132.2531/0.913831.4809/0.906733.1097/0.9244
0.2534.1672/0.927236.7785/0.951036.9258/0.954137.3302/0.955836.9154/0.959435.8560/0.955937.6505/0.9630
0.5039.4886/0.970641.6234/0.977942.1352/0.980442.4728/0.980942.1788/0.9825-/-42.7052/0.9842
McM180.0119.9893/0.494223.4088/0.631623.7917/0.643123.0500/0.6372-/-23.6266/0.614424.0858/0.6427
0.0424.2732/0.657727.9489/0.789127.9164/0.788728.1609/0.799828.0115/0.794327.5373/0.790728.3934/0.8126
0.1028.5360/0.810431.9249/0.887831.7231/0.886932.3243/0.897731.8816/0.896430.9669/0.891332.5369/0.9103
0.2533.9880/0.923736.9213/0.953737.0400/0.957037.7359/0.961437.1519/0.960535.8945/0.957937.9561/0.9668
0.5039.5162/0.972842.2930/0.983443.0616/0.986643.6171/0.987542.9903/0.9872-/-43.8347/0.9894
BSD1000.0119.1925/0.405621.8885/0.510322.2577/0.522522.1172/0.5126-/-22.22/0.502922.7562/0.5248
0.0422.2471/0.541125.0002/0.651625.0882/0.659325.2757/0.665325.0518/0.669024.9076/0.666525.4058/0.6839
0.1025.0920/0.684127.5465/0.771527.6100/0.778627.8919/0.784727.5527/0.793727.1727/0.790328.0507/0.8062
0.2529.0354/0.840231.2010/0.887031.4876/0.897431.6787/0.898431.3815/0.903930.6594/0.900931.9835/0.9121
0.5033.7145/0.937136.0208/0.958236.7344/0.965836.7421/0.965636.4358/0.9668-/-37.1064/0.9704
General1000.0118.9989/0.470022.5268/0.622922.9131/0.632122.8558/0.6276-/-22.9139/0.601823.4813/0.6374
0.0423.7578/0.654927.6234/0.786527.4456/0.784627.9241/0.796927.6451/0.794727.2431/0.788828.2144/0.8104
0.1028.5443/0.810432.0279/0.886331.5631/0.883832.4102/0.896831.7109/0.896430.8719/0.889532.7478/0.9093
0.2534.3164/0.925037.1454/0.953036.9876/0.355237.5467/0.959837.2737/0.961435.7811/0.956838.2174/0.9667
0.5039.9733/0.974042.5183/0.983542.8420/0.985743.2621/0.986942.9651/0.9874-/-43.8010/0.9891
Avg.0.0119.1550/0.442422.2975/0.572822.6792/0.583522.5305/0.5767-/-22.6566/0.558423.1962/0.5873
0.0423.1151/0.604326.4806/0.727026.4350/0.729526.7659/0.738926.5178/0.739026.2264/0.734726.9770/0.7546
0.1026.9969/0.754230.0208/0.835429.8162/0.837330.3869/0.846929.8722/0.850729.2344/0.845530.6326/0.8635
0.2531.9184/0.886934.4534/0.923434.5241/0.660334.9257/0.932334.6135/0.935433.4952/0.931835.3881/0.9421
0.5037.1189/0.957339.5664/0.972040.1050/0.976740.3492/0.977240.0215/0.9780-/-40.7771/0.9806
Table 2. Comparison of parameters, model size, and computational time.
Table 2. Comparison of parameters, model size, and computational time.
ISTA-Net+OPINE-Net+AMP-Net-BMDGU-Net+TransCSTCS-NetTCR-Net
Time (s)0.00710.00880.03830.02780.03600.02120.0128
Parameters (M)0.340.620.586.921.440.520.37
Size (MB)1.42.52.49.621.42.14.7
Table 3. Average PSNR (dB) performance comparisons on datasets at a CS ratio of 0.25.
Table 3. Average PSNR (dB) performance comparisons on datasets at a CS ratio of 0.25.
DatasetsTCR-Net-(7, 7, 1)TCR-Net-(3, 5, 1)TCR-Net
Set537.573737.501637.6505
McM1837.852337.789337.9561
BSD10031.917931.882831.9835
General10038.083938.026238.2174
Avg.35.288735.240435.3881
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nan, R.; Sun, G.; Zheng, B.; Zhang, P. Hybrid Transformer and Convolution for Image Compressed Sensing. Electronics 2024, 13, 3496. https://doi.org/10.3390/electronics13173496

AMA Style

Nan R, Sun G, Zheng B, Zhang P. Hybrid Transformer and Convolution for Image Compressed Sensing. Electronics. 2024; 13(17):3496. https://doi.org/10.3390/electronics13173496

Chicago/Turabian Style

Nan, Ruili, Guiling Sun, Bowen Zheng, and Pengchen Zhang. 2024. "Hybrid Transformer and Convolution for Image Compressed Sensing" Electronics 13, no. 17: 3496. https://doi.org/10.3390/electronics13173496

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop