Hybrid Transformer and Convolution for Image Compressed Sensing

Nan, Ruili; Sun, Guiling; Zheng, Bowen; Zhang, Pengchen

doi:10.3390/electronics13173496

Open AccessArticle

Hybrid Transformer and Convolution for Image Compressed Sensing

by

Ruili Nan

,

Guiling Sun

^*

,

Bowen Zheng

and

Pengchen Zhang

College of Electronic Information and Optical Engineering, Nankai University, Tianjin 300350, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(17), 3496; https://doi.org/10.3390/electronics13173496

Submission received: 23 July 2024 / Revised: 9 August 2024 / Accepted: 2 September 2024 / Published: 3 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, deep unfolding networks (DUNs) have received widespread attention in the field of compressed sensing (CS) reconstruction due to their good interpretability and strong mapping capabilities. However, existing DUNs often improve the reconstruction effect at the expense of a large number of parameters, and there is the problem of information loss in long-distance feature transmission. Based on the above problems, we propose an unfolded network architecture that mixes Transformer and large kernel convolution to achieve sparse sampling and reconstruction of natural images, namely, a reconstruction network based on Transformer and convolution (TCR-Net). The Transformer framework has the inherent ability to capture global context through a self-attention mechanism, which can effectively solve the challenge of long-range dependence on features. TCR-Net is an end-to-end two-stage architecture. First, a data-driven pre-trained encoder is used to complete the sparse representation and basic feature extraction of image information. Second, a new attention mechanism is introduced to replace the self-attention mechanism in Transformer, and a hybrid Transformer and convolution module based on optimization-inspired is designed. Its iterative process leads to the unfolding framework, which approximates the original image stage by stage. Experimental results show that TCR-Net outperforms existing state-of-the-art CS methods while maintaining fast computational speed. Specifically, when the CS ratio is 0.10, the average PSNR on the test set used in this paper is improved by at least 0.8%, the average SSIM is improved by at least 1.5%, and the processing speed is higher than 70FPS. These quantitative results show that our method has high computational efficiency while ensuring high-quality image restoration.

Keywords:

compressed sensing; image reconstruction; Transformer; large kernel; attention block

1. Introduction

Compressed sensing (CS) is a new information collection method. The sparse signal

x \in R^{N}

can be reconstructed with high probability through measurement values whose dimensions are much smaller than the original signal dimension N [1]. CS research mainly focuses on two aspects: (1) designing efficient sampling matrices; (2) building high-quality reconstruction solvers to recover high-dimensional original signals from low-dimensional measurements. The research focus of this paper is on the latter. The applications of CS technology include but are not limited to wireless sensor networks [2], medical imaging [3], single-pixel imaging [4], etc., because it greatly alleviates the information collection problem’s demand for high transmission bandwidth and large storage space and at the same time can recover with high probability. Mathematically, in the sampling stage, the image

x \in R^{N}

is quickly sampled to obtain a linear random measurement value

y = Φ x \in^{M}

, where

Φ \in^{M \times N}

is the measurement matrix,

M < < N

, and the sampling rate is

M / N

. The reconstruction stage uses the low-dimensional measurement value

y

to restore the original image

x

. It is not difficult to see that the inverse problem is underdetermined, so there are infinite solutions to the problem in theory. To obtain reliable reconstructions, traditional CS methods usually solve for an energy function:

\underset{x}{arg min} (\frac{1}{2} {∥Φ x - y∥}_{2}^{2} + α F (x)) .

(1)

In Equation (1),

\frac{1}{2} {∥Φ x - y∥}_{2}^{2}

represents the data fidelity term, which measures the similarity between the reconstructed image and the original image, and

α F (x)

represents the prior term with regularization parameter

α

. Due to the underdetermination of the inverse problem, for traditional CS methods, the prior term can be a sparse operator corresponding to some predefined transformation bases, such as wavelet transform and discrete cosine transform (DCT) [5,6]. In most cases, they show strong convergence and are supported by theoretical analysis, but their use commonly faces the limitations of high computational complexity and low adaptability [7]. In recent years, due to the powerful learning ability of deep neural networks, a series of image CS methods based on deep networks have been proposed [8,9]. These methods relax the assumptions on the sparsity of the original image and jointly optimize the sampling matrix and nonlinear recovery operator [10,11] so that the relationship between them can be measured through end-to-end training, coordinate with each other, learn the structure and texture features of the image more effectively, and greatly improve the efficiency and quality of image CS reconstruction. Among them, deep unfolding networks (DUNs) have received widespread attention due to their good interpretability and strong mapping.

However, existing DUNs are often limited by models and are prone to feature information loss during the iteration process; in addition to the local features of the image captured by a CNN, the global position information of the image is also important. It is difficult to comprehensively learn the global information of the image using only a simple CNN due to the limitations of stacked convolutional layers, which result in over-parameterization and the problem of redundant filters. These issues naturally constrain the effectiveness of the receptive field, potentially limiting the performance of image reconstruction [12,13].

This paper aims to use the inherent ability of the Transformer framework to capture global context to effectively solve the above problems. As the core module of Transformer, self-attention [14] was originally designed for 1D natural language processing (NLP) tasks and treats 2D images as 1D sequences, which destroys the key 2D structure of the image. This raises the question of how to design CS reconstruction models suitable for images to meet the requirements of global correlation modeling.

To solve the problem, this paper proposes an optimization-inspired hybrid Transformer and convolution module (TC) as an iterative process and establishes a TC-based image CS unfolding framework, which is shown in Figure 1, to achieve joint optimization of image sparse sampling and reconstruction, namely, a reconstruction network based on Transformer and convolution (TCR-Net). In the TC module, in order to reduce the feature loss problem caused by the inherent structure of the expanded network, an information transmission path is built between adjacent stages; meanwhile, a new dual-channel large kernel attention (Dual-LKA) is designed to replace the original self-attention in Transformer; Dual-LKA absorbs the advantages of convolution and self-attention, including local structure information, long-range dependence, and adaptability. At the same time, it avoid the shortcomings of ignoring the adaptability of channel dimensions. In general, our TCR-Net has good interpretability. Moreover, it can make up for the common problems of information loss and incomplete global feature acquisition in DUNs, ensuring the integrity of the information to a greater extent, which is conducive to more accurate and faster image CS reconstruction.

In summary, our main contributions are summarized as follows:

Combining Transformer and large kernel convolution, an optimization-inspired TC module is designed and its iterative process is used to construct a novel image CS unfolding framework, TCR-Net, which realizes the joint optimization of image sparse sampling and reconstruction.
In order to extract feature information more completely, an information transfer path is built between neighboring TC modules. Meanwhile, a new dual-channel large kernel attention mechanism (Dual-LKA) is proposed to deal with contextual information efficiently while retaining local descriptions, which integrates the advantages of convolution and self-attention, and avoids their drawbacks, and makes TCR-Net more suitable for CS reconstruction of images.
Extensive experiments demonstrate that our proposed TCR-Net outperforms existing state-of-the-art CS methods while maintaining fast computational speed.

This paper is organized as follows: Section 1 mainly introduces the research background and existing research results of image CS then briefly introduces the main contributions. Section 2 introduces the related work of CS. Section 3 elaborates on our proposed TCR-Net. Section 4 shows the comparative experiments between TCR-Net and existing advanced algorithms and analyzes and discusses the experimental results. Section 5 summarizes the full paper and looks forward to the future work.

2. Related Work

2.1. Deep Unfolding Network

The idea of DUNs is to cascade traditional iterative optimization algorithms through neural networks. DUNs have good interpretability on the training data pair

{(y_{i}, x_{i} {)}}_{i = 1}^{N_{a}}

, which are usually formulated as a two-layer optimization problem in the CS structure:

\{\begin{matrix} min_{Θ} \sum_{i = 1}^{N_{a}} L ({\hat{x}}_{i}, x_{i}), \\ s . t . {\hat{x}}_{i} = \underset{x}{arg min} (\frac{1}{2} {∥Φ x - y_{i}∥}_{2}^{2} + α F (x)) . \end{matrix}

(2)

In recent years, some optimization methods, such as iterative shrinkage threshold algorithm (ISTA) [15], approximate message passing (AMP) [16], etc., have been continuously developed [17,18]. However, from the perspective of the structure of the neural network, the inherent transmission path of DUNs is prone to information feature loss, each stage is greatly affected by adjacent stages, and the ability to obtain global context information is poor.

2.2. Transformer

Inspired by the success of Transformers [19] in NLP, researchers began to study extending the Transformer structure to various computer vision tasks [20], such as segmentation [21], object detection [22], and image restoration [23]. Despite its success, its core module, self-attention, still has its shortcomings [24]. In addition to destroying the critical 2D structure of the image as mentioned above, high-resolution image processing is also difficult due to secondary computation and memory overhead. In addition, self-attention is a special attention that only considers the adaptation of the spatial dimension and ignores the adaptation of the channel dimension, which is also important for visual tasks [25]. Therefore, the practical application of Transformer in visual tasks requires further exploration. Due to the replaceability of self-attention in visual tasks [26], an attention mechanism is designed that is more suitable for refined image restoration, taking into account local structural information and long-range dependencies while ensuring that the network is adaptive in the spatial and channel dimensions.

3. Proposed Method

In this section, we will describe the proposed TCR-Net in detail.

3.1. Overall Architecture

Considering simplicity and interpretability, we follow [27] and directly expand the traditional proximal gradient descent (PGD) [28] to solve Equation (1) and express it as an iterative function, namely, Equation (3):

{\hat{x}}^{(k)} = \underset{x}{arg min} (\frac{1}{2} {∥x - ({\hat{x}}^{(k - 1)} - λ^{(k)} \nabla g ({\hat{x}}^{(k - 1)})∥}_{2}^{2} + α F (x)),

(3)

where

{\hat{x}}^{(k)}

represents the output of the k-th iteration, and

g (\cdot)

represents the data fidelity term in Equation (1). ∇ is a differential operator weighted by the step size

λ^{(k)}

.

In a practical sense, inspired by OPINE-Net [17], which is a DUN unfolded by ISTA, Equation (3) can be divided into two sub-problems, gradient descent (GD, Equation (4)) and proximal mapping (PM, Equation (5)), which is actually a CNN-based denoiser. Different from OPINE-Net, we use PM with generalized design, which is different from the hand-crafted

l_{1}

(in OPINE-Net, the prior term is defined as

l_{1}

norm,

p r o x_{α, F} (r^{(k)}) = s i g n (r^{(k)}) max (0, | r^{(k)} | - α)

,

l_{1}

norm makes model training prefer to select relatively few features) in that it has wider representation capabilities and is easily extended to other degenerate tasks.

r^{(k)} = {\hat{x}}^{(k - 1)} - λ^{(k)} Φ^{T} (Φ {\hat{x}}^{k - 1} - y),

(4)

{\hat{x}}^{(k)} = p r o x_{α, F} (r^{(k)}) = \underset{x}{arg min} (\frac{1}{2} {∥x - r^{(k)}∥}_{2}^{2} + α F (x)) .

(5)

Iteratively update

r^{(k)}

and

{\hat{x}}^{(k)}

until convergence, where

r^{(k)}

represents an intermediate variable in the k-th iteration.

Therefore, the iterative process of TCR-Net can be briefly expressed as Equation (6),

k \in [1, 2, \dots, N_{s}]

, where

N_{s}

represents the number of TC modules, that is, the number of network stages.

{\hat{x}}^{(k)} = H_{P M}^{(k)} ({\hat{x}}^{(k - 1)} - λ^{(k)} Φ^{T} (Φ {\hat{x}}^{k - 1} - y)) .

(6)

3.2. Architecture Design of TCR-Net

In this section, we will elaborate on the structural design and related theories of TCR-Net. As shown in Figure 1, the network is divided into two stages. The first is adaptive sparse sampling of images to obtain measurements; the second is inverse problem solving to achieve end-to-end inverse mapping of measurements to original images, including initial reconstruction (IR) and deep reconstruction (DR).

3.2.1. Sampling and Initial Reconstruction

In order to obtain better image reconstruction results, a data-driven pre-trained encoder is used to complete the sparse representation of image information with basic feature extraction. Specifically, we first divide the image

X

into non-overlapping blocks

{x_{i}}

of size

C \times B \times B

, where B denotes the block size and C denotes the image channels, and in the paper, we set

C = 1

by default. In order not to destroy the 2D structure of the image, convolution without bias is used to implement the sampling. Assuming that the measurement matrix is

Φ \in^{M \times N} (M ≪ N)

, then M convolution kernels of size

1 \times \sqrt{N} \times \sqrt{N}

are used to complete the convolution with

{x_{i}}

(simplified as

x

). This can achieve sparse sampling of the image and obtain the measurements

{y_{i}}

(simplified as

y

). Using block-by-block sampling, set the convolution step size to P, i.e.,

P = \sqrt{N}

. In general, B is an integer multiple of P. Expressing the sampling module as

S (\cdot)

, then

y = S (x)

.

Use

Φ^{T}

to achieve the IR of the image so that no additional parameters need to be introduced. That is,

Φ

is transposed into N convolution kernels of size

M \times 1 \times 1

and convolved with

y

with a step size of 1 to obtain the initial reconstruction

{\hat{x}}^{(0)}

. Express the IR module as

I (\cdot)

, then

{\hat{x}}^{(0)} = I (S (x))

.

3.2.2. Deep Reconstruction

The DR is iterated by the TC module

N_{s}

, times. The structure of the k-th TC module is shown in Figure 2. Each TC module contains GD and PM. The k-th TC module has inputs

{\hat{x}}^{(k - 1)}

and

z^{(k - 1)}

and outputs

{\hat{x}}^{(k)}

and

z^{(k)}

. The two tensors are iteratively updated. Restricted by the structure of DUNs, the input and output of each stage are a single-channel image, and utilizing 3 × 3 convolution in PM to achieve multi-channel to single-channel feature conversion belongs to the lossy conversion process, which leads to the loss of image details. Therefore, by utilizing the output

z^{(k)}

of the feed-forward neural network, the information transmission path is built directly between adjacent stages and transmitted to the next stage, which can effectively avoid the information loss due to the channel shrinkage conversion.

The symmetry of

ξ (\cdot)

and

ζ (\cdot)

helps the model better understand and utilize the input symmetry information. Since the sparsity or structure of information in compressed sensing is often related to symmetry, this design helps information transfer and feature extraction.

Attention module

A (\cdot)

is the core of TC module design. In Transformer, it uses the self-attention mechanism to capture long-range dependencies and plays an increasingly important role in computer vision [29,30]. However, as described in Section 2, it has obvious shortcomings that cannot be ignored. The attention mechanism can be viewed as an adaptive selection process, which can select discriminative features based on input features and automatically ignore noisy responses. The key steps of the attention mechanism are to generate an attention graph representing the importance of different parts and to learn the relationship between different features. Of course, we can also use large kernel convolutions [31,32] to build correlations and generate attention maps. This approach still has obvious shortcomings. Large kernel convolution brings a lot of computational overhead and parameters. In order to overcome the above shortcomings and take advantage of self-attention and large kernel convolution, we utilize decomposable large kernel convolution to capture long-range relationships.

Guo et al. [33] proved through detailed experiments that large kernel convolution can be effectively decomposed into a combination of three convolutions, namely, depth-wise convolution, depth-wise dilation convolution, and 1 × 1 convolution. At the same time, an attention mechanism is involved in the decomposition process; in other words, decomposable large kernel convolution provides a similar feeling field to the self-attention mechanism. Through decomposition, we can capture long-range relations with small computational cost and parameters and thus estimate the importance of individual data points and generate the corresponding attention graphs.

Specifically, we decompose a

K \times K

convolution into a

(2 d - 1) \times (2 d - 1)

depth-wise convolution, and a

⌈K / d⌉ \times ⌈K / d⌉

depth-wise dilation convolution with dilation d, a 1 × 1 convolution, where K is the size of the convolution kernel. Thus, the number of parameters

P (K, d)

and floating-point operations (FLOPs)

P (F, d)

can be denoted as follows for the inputs with dimensions

H \times W

and channel C:

P (K, d) = ({(2 d - 1)}^{2} + {⌈K / d⌉}^{2} + C + 3) \cdot C,

(7)

F (K, d) = P (K, d) \times H \times W .

(8)

From Equations (7) and (8), it can be found that

P (K, d)

increases quadratically with K and C, and

F (K, d)

grows linearly with

P (K, d)

and image size. When the reconstruction object is determined, the H, W, and C of the image are fixed, and the computational cost can be reduced by reducing

P (K, d)

. Therefore, in order to minimize

P (K, d)

for a fixed kernel size K and reduce the network computing cost, the derivative of Equation (7) with respect to d is set to zero, where

⌈K / d⌉

is approximately equal to

K / d

.

\frac{\partial}{\partial d} P (K, d) \begin{matrix} ! \\ = \end{matrix} 0 = (8 d - \frac{2 K^{2}}{d^{3}} - 4) \cdot C .

(9)

In order to make

A (\cdot)

have richer multi-scale information, a multi-scale mechanism is introduced and a Dual-LKA is designed. Its structure diagram is shown in Figure 3. After the input features undergo 1 × 1 convolution and GELU activation function, two large kernel decompositions of different sizes are applied to it, that is, K = 9 and K = 27 are set, respectively. According to the solution of Equation (9), when K = 27,

d \approx 3.8

; when K = 9,

d \approx 2.3

, so d is set to 2 and 4, respectively. Therefore, the related kernel parameters are (7, 7, 1) and (3, 5, 1), respectively.

Expressing the TC module as

D (\cdot)

and the convolution operation as

C (\cdot)

, the iterative process of the DR network can be expressed as:

{\hat{x}}^{(k)}, z^{(k)} = D ({\hat{x}}^{(k - 1)}, z^{(k - 1)}),

(10)

among them,

z^{(0)} = C ({\hat{x}}^{(0)})

. After

N_{s}

iterations, the final reconstructed image

{\hat{x}}^{(N_{s})}

can be obtained.

In summary, the complete implementation process of TCR-Net is summarized in Algorithm 1.

| |

represents concatenation by channel dimension, and

C_{k \times k} (\cdot)

represents

k \times k

convolution.

Algorithm 1: TCR-Net for Image Compressed Sensing.

Input: $x$ , initialize the iteration depth $k = 1$ and ceiling $N_{s}$ , learnable $Φ$ and $λ^{(k)}$
Output: ${\hat{x}}^{(N_{s})}$
Adaptive Sampling: $y = S (x)$
Reconstruction:
Initialization: ${\hat{x}}^{(0)} = I (S (x))$ , $z^{(0)} = C_{3 \times 3} ({\hat{x}}^{(0)})$
for $k = 1, \dots, N_{s}$ do
$r^{(k)} = {\hat{x}}^{(k - 1)} - λ^{(k)} Φ^{T} (Φ {\hat{x}}^{k - 1} - y)$
$b^{(k)} = ξ (C_{3 \times 3} (r^{(k)} | | z^{(k - 1)})) + C_{3 \times 3} (r^{(k)} | | z^{(k - 1)})$
$c^{(k)} = A (L N (b^{(k)})) + b^{(k)}$
$z^{(k)} = ζ (L N (c^{(k)})) + c^{(k)}$
${\hat{x}}^{(k)} = C_{3 \times 3} (z^{(k)}) + r^{(k)}$
Return ${\hat{x}}^{(N_{s})}$

3.3. Loss Function and Network Parameters

Given the training dataset

{x_{i}}

and the CS ratio

γ

, first input

x_{i}

into TCR-Net to complete adaptive sampling and reconstruction in sequence and output the final restored image, which is

{\hat{x}}_{i}^{(N_{s})}

; secondly, use MSE to measure the difference between

x_{i}

and

{\hat{x}}_{i}^{(N_{s})}

(

N_{s}

is the number of phases of DR).

L = {∥{\hat{x}}_{i}^{(N_{s})} - x_{i}∥}_{2}^{2} .

(11)

The TCR-Net proposed in this article is an end-to-end mapping network that can be learned throughout the entire process. All involved parameters (such as measurement matrices, nonlinear transformations, etc.) are learned using end-to-end backpropagation, which has the advantage of fast and accurate reconstruction performance and explicit interpretability.

Specifically, the learnable parameter set

Θ

in TCR-Net includes

Φ

,

λ

,

ξ (\cdot)

,

A (\cdot)

,

ζ (\cdot)

, and several convolution modules

C_{k \times k} (\cdot)

at different scales, i.e.,

Θ = {Φ, λ, ξ (\cdot), A (\cdot), ζ (\cdot),

C_{k \times k} (\cdot)}

. Note that the same parameters are shared in all TC modules except

λ

.

In TC, the first

C_{3 \times 3} (\cdot)

has C+1 input channels and C output channels, and the last

C_{3 \times 3} (\cdot)

has C input channels and one output channel. After

ξ (\cdot)

,

A (\cdot)

, and

ζ (\cdot)

, the number of channels remains constant at C.

C_{3 \times 3} (\cdot)

in

z^{(0)} = C_{3 \times 3} ({\hat{x}}^{(0)})

has one input channel and C output channels.

4. Experiment

4.1. Experimental Settings

4.1.1. Datasets and Performance Measures

Our training dataset uses train (200 images) and test (200 images) from BSD500 [34], and the validation dataset is Set11 [35]. All images in the training dataset are randomly cropped into 200 (96 × 96) sub-images, so the training dataset contains a total of 80,000 training sub-images. During testing, the optimal model is selected and the algorithm is tested on four widely used benchmark datasets, Set5 [36], McM18 [37], BSD100 [34], and General100 [38].

In order to measure the performance of each algorithm, two commonly used metrics, peak signal-to-noise ratio (PSNR) and structural similarity (SSIM), are used to comprehensively evaluate the quality of image reconstruction. PSNR represents the peak signal-to-noise ratio between the reconstructed image and the original image, which is used to quantify the image quality. SSIM evaluates the structural similarity between the reconstructed image and the original image, reflecting the visual quality of the image. We also show parameters, model size, and average computational time of each method to measure their performance.

4.1.2. Implementation Details

Using different CS ratios for adaptive sampling and reconstruction,

γ = {0.01, 0.04, 0.10,

0.25, 0.50}

and a corresponding adaptive measurement matrix

Φ \in^{M \times N}

are obtained. In TCR-Net, B = 96, P = 32, so N = 1024 and

M = γ \cdot N

in

Φ

. The training in this subsection was conducted on RTX 4090 (24 GB), Python with PyTorch version 1.11.0. All tests and ablation studies were conducted on an Intel Xeon(R) W-2145 CPU plus an NVIDIA Quadro RTX 4000 GPU. TCR-Net is trained for 100 epochs, the batch size is 16, and the number of feature map channels C is 32. We use the Adam [39] optimizer to train the network with the initial learning rate of 4 × 10⁻⁵, which decreased to 5 × 10⁻⁵ through 100 epochs using the cosine annealing strategy [40], and the number of warm-up epochs is 3. It is worth noting that color images are processed in the YCbCr space and evaluated on the Y channel. For the number of stages, considering the device performance, we set the number of stages

N_{s}

to 7 for a better trade-off between model performance and complexity.

4.2. Comparisons with State-of-the-Art Methods

In order to evaluate the performance of the proposed TCR-Net, it is compared with existing representative CS methods (including ISTA-Net+ [41], OPINE-Net+ [17], AMP-Net-BM [18], DGU-Net+ [42], TransCS [43], TCS-Net [44]) in terms of reconstruction quality and algorithm complexity. These comparison algorithms all belong to DUNs. ISTA-Net+ uses Gaussian random matrix (GRM), and the other algorithms all use learnable measurement matrices. TransCS and TCS-Net both introduce the Transformer structure.

Table 1 shows the experimental comparison on datasets at multiple CS ratios. It can be observed from Table 1 that, in all cases, our TCR-Net outperforms all other competing methods in terms of PSNR and SSIM. Avg. in Table 1 represents the average reconstruction quality of the image at a certain CS ratio of each algorithm. The calculation method is as follows:

\{\begin{matrix} p = \sum_{i = 1}^{D} p_{i} \cdot n_{i} / \sum_{i = 1}^{D} n_{i} \\ s = \sum_{i = 1}^{D} s_{i} \cdot n_{i} / \sum_{i = 1}^{D} n_{i} \end{matrix},

(12)

In Equation (12),

p_{i}

and

s_{i}

represent the PSNR and SSIM of the i-th dataset, respectively.

n_{i}

represents the number of images in the i-th dataset, and D represents the number of datasets. In our paper, D = 4. Therefore, p and s represent the average PSNR and SSIM of four datasets at a certain CS ratio for a certain algorithm. For example, when the CS ratio is 0.10, TCR-Net outperforms ISTA-Net+ [41], OPINE-Net+ [17], AMP-Net-BM [18], DGU-Net+ [42], TransCS [43], and TCS-Net [44] by 3.6357 dB, 0.6118 dB, 0.8164 dB, 0.2457 dB, 0.7604 dB, and 1.3982 dB in terms of PSNR, respectively. Figure 4 further shows the visual comparison at a CS ratio of 0.10. It can be seen that our TCR-Net can obtain clearer and more accurate reconstructed images than others.

For algorithm complexity, Figure 5 shows the real image reconstruction performance of each algorithm under different parameter capacities on Set5 [36]. Meanwhile, Table 2 shows the comparison of parameters, model sizes, and computational time for reconstructing a 256 × 256 image of different CS methods at a CS ratio of 0.10. Time includes the sampling and reconstruction process.

Moreover, TCR-Net takes an average of 14.3 milliseconds (ms) to reconstruct a 1920 × 1080 single-channel image with GPU; that is, the reconstruction speed is about 70 FPS, which is higher than the frame rate of high-definition video at 60 FPS. It has great advantages for processing tasks with limited resources and high time costs. In summary, TCR-Net is an effective CS image restoration method with high image reconstruction quality and good stability.

4.3. Ablation Studies and Discussions

4.3.1. Measurement Matrix

TCR-Net utilizes a learnable measurement matrix to achieve adaptive sampling; specifically, a data-driven pre-trained encoder is used to complete the sparse representation of an image and the extraction of basic features. Figure 6 shows the results of image CS reconstruction achieved by using the learnable measurement matrix and GRM, respectively, at a CS ratio of 0.10, and we can find that at least a 1.67 dB gain is obtained by using the learnable measurement matrix compared to the GRM, and the image quality is greatly improved. Meanwhile, Figure 7 shows a simple visualization of the two measurement matrices in the frequency domain, where the learned measurement matrix can adaptively assist and balance the amount of low-frequency and high-frequency information retained in the sensed measurements for better image reconstruction.

4.3.2. Dual-LKA

In this subsection, in order to prove that our Dual-LKA design is reasonable, comparative experiments between Dual-LKA and Single-LKA are conducted. The dual-channel parameters of Dual-LKA are (7, 7, 1) and (3, 5, 1), so Single-LKA is set to (7, 7, 1) and (3, 5, 1), respectively. Table 3 demonstrates the PSNR values of image CS reconstruction for each dataset at a sampling rate of 0.25, and the average PSNR is improved by 0.0994 dB and 0.1477 dB, respectively. Figure 8 shows the visualized comparison of

baby_GT

in Set5. It can be found that the reconstructed image (eyelash root) is clearer and more accurate when Dual-LKA is used. In summary, the quality of image reconstruction can be really improved by using Dual-LKA.

4.3.3. Sensitivity to Noise

In practice, imaging models may be affected by noise, so to test the robustness of our designed TCR-Net, we first add Gaussian noise with different noise levels to the images of Set5 and BSD100. Then, the noisy images are used as inputs to each model input, and Set5 and BSD100 are sampled and recovered at a CS ratio of 0.25. Figure 9 shows the PSNR values for all methods plotted against various standard variance noises (mean = 0, standard deviation = Noise). It can be seen that our TCR-Net is robust to noise corruption.

5. Conclusions

Our paper innovatively proposes TCR-Net, a fully end-to-end learnable image CS framework which fully utilizes the interpretability and strong mapping of the DUNs to achieve high-quality, low-latency image reconstruction. TCR-Net simultaneously achieves joint optimization of adaptive sparse sampling and reconstruction of natural images. We design a Dual-LKA based on Transformer and large kernel decomposition convolution, which can effectively process contextual information while focusing on local features. On this basis, a TC module based on optimization-inspired is constructed, and an information transmission path is built between adjacent TC modules, successfully reducing the problem of feature loss caused by channel transformation. Experimental results show that, compared with other mainstream CS methods, TCR-Net has higher reconstruction performance and ability to perceive images. In the future, TCR-Net can be applied to MRI, satellite imaging, and video surveillance systems through network adaptive adjustment.

Author Contributions

Conceptualization, methodology, writing—original draft preparation, writing—review and editing, R.N.; funding acquisition, project administration, resources, G.S.; writing—review and editing, validation, formal analysis, B.Z.; writing—review and editing, validation, P.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Tianjin Natural Science Foundation grant number 21JCZDJC00340 and National Natural Science Foundation of China grant number 61771262, 61901233.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Donoho, D.L. Compressed sensing. IEEE Trans. Inf. Theory 2006, 52, 1289–1306. [Google Scholar] [CrossRef]
Tian, X.; Wei, G.; Wang, J. Target location method based on compressed sensing in hidden semi Markov model. Electronics 2022, 11, 1715. [Google Scholar] [CrossRef]
Fei, T.; Feng, X. Learing sampling and reconstruction using Bregman iteration for CS-MRI. Electronics 2023, 12, 4657. [Google Scholar] [CrossRef]
Abedi, M.; Sun, B.; Zheng, Z. Single-pixel compressive imaging based on random DoG filtering. Signal Process. 2021, 178, 107746. [Google Scholar] [CrossRef]
Zhao, C.; Ma, S.; Zhang, J.; Xiong, R.; Gao, W. Video compressive sensing reconstruction via reweighted residual sparsity. IEEE Trans. Circuits Syst. Video Technol. 2017, 27, 1182–1195. [Google Scholar] [CrossRef]
Zhao, C.; Ma, S.; Gao, W. Image compressive-sensing recovery using structured laplacian sparsity in DCT domain and multi-hypothesis prediction. In Proceedings of the 2014 IEEE International Conference on Multimedia and Expo (ICME), Chengdu, China, 14–18 July 2014. [Google Scholar] [CrossRef]
Zhao, C.; Zhang, J.; Ma, S.; Gao, W. Non-convex Lp Nuclear Norm based ADMM Framework for Compressed Sensing. In Proceedings of the 2016 Data Compression Conference (DCC), Snowbird, UT, USA, 30 March–1 April 2016; pp. 161–170. [Google Scholar] [CrossRef]
Wang, B.; Lian, Y.; Xiong, X.; Zhou, H.; Liu, Z.; Das, M. Progressive feature reconstruction and fusion to accelerate MRI imaging: Exploring insights across low, mid, and high-order dimensions. Electronics 2023, 12, 4742. [Google Scholar] [CrossRef]
Xie, Y.; Li, Q. A review of deep learning methods for compressed sensing image reconstruction and its medical applications. Electronics 2022, 11, 586. [Google Scholar] [CrossRef]
Shi, W.; Jiang, F.; Liu, S.; Zhao, D. Scalable convolutional neural network for image compressed sensing. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12282–12291. [Google Scholar] [CrossRef]
You, D.; Zhang, J.; Xie, J.; Chen, B.; Ma, S. COAST: Controllable arbitrary-sampling network for compressive sensing. IEEE Trans. Image Process. 2021, 30, 6066–6080. [Google Scholar] [CrossRef]
Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. Understanding the effective receptive field in deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 4905–4913. [Google Scholar]
Prakash, A.; Storer, J.; Florencio, D.; Zhang, C. RePr: Improved training of convolutional filters. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10658–10667. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2021, arXiv:2010.11929v2. [Google Scholar]
Donoho, D.L. De-noising by soft-thresholding. IEEE Trans. Inf. Theory 1995, 41, 613–627. [Google Scholar] [CrossRef]
Donoho, D.L.; Maleki, A.; Montanari, A. Message-passing algorithms for compressed sensing. Proc. Natl. Acad. Sci. USA 2009, 106, 18914–18919. [Google Scholar] [CrossRef]
Zhang, J.; Zhao, C.; Gao, W. Optimization-inspired compact deep compressive sensing. IEEE J. Sel. Top. Signal Process. 2020, 14, 765–774. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, Y.; Liu, J.; Wen, F.; Zhu, C. AMP-Net: Denoising-based deep unfolding for compressive image sensing. IEEE Trans. Image Process. 2021, 30, 1487–1500. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5999–6099. [Google Scholar]
Zhang, J.; Huang, Y.; Wu, W.; Lyu, M.R. Transferable adversarial attacks on vision transformers with token gradient regularization. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 16415–16424. [Google Scholar] [CrossRef]
Wang, Y.; Xu, Z.; Wang, X.; Shen, C.; Cheng, B.; Shen, H.; Xia, H. End-to-end video instance segmentation with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 17378–17389. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. In Proceedings of the 9th International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Wang, Z.; Cun, X.; Bao, J.; Zhou, W.; Liu, J.; Li, H. Uformer: A general U-shaped Transformer for image restoration. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17662–17672. [Google Scholar] [CrossRef]
Zhang, J.; Huang, J.T.; Wang, W.; Li, Y.; Wu, W.; Wang, X.; Su, Y.; Lyu, M.R. Improving the transferability of adversarial samples by path-augmented method. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 8173–8182. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Liu, H.; Dai, Z.; So, D.R.; Le, Q.V. Pay attention to MLPs. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021; pp. 9204–9215. [Google Scholar]
Lefkimmiatis, S. Universal denoising networks: A novel CNN architecture for image denoising. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3204–3213. [Google Scholar] [CrossRef]
Parikh, N.; Boyd, S. Proximal algorithms. Found. Trends Optim. 2014, 1, 127–239. [Google Scholar] [CrossRef]
Xu, Y.; Wei, H.; Lin, M.; Deng, Y.; Sheng, K.; Zhang, M.; Tang, F.; Dong, W.; Huang, F.; Xu, C. Transformers in computational visual media: A survey. Comput. Vis. Media 2022, 8, 33–62. [Google Scholar] [CrossRef]
Bello, I.; Zoph, B.; Vaswani, A.; Shlens, J.; Le, Q.V. Attention augmented convolutional networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3285–3294. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. BAM: Bottleneck attention module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
Guo, M.H.; Lu, C.Z.; Liu, Z.N.; Cheng, M.M.; Hu, S.M. Visual attention network. Comput. Vis. Media 2023, 9, 733–752. [Google Scholar] [CrossRef]
Arbelaez, P.; Maire, M.; Fowlkes, C.; Malik, J. Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 898–916. [Google Scholar] [CrossRef]
Kulkarni, K.; Lohit, S.; Turaga, P.; Kerviche, R.; Ashok, A. ReconNet: Non-iterative reconstruction of images from compressively sensed measurements. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 449–458. [Google Scholar] [CrossRef]
Bevilacqua, M.; Roumy, A.; Guillemot, C.; Alberi-Morel, M.L. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of the 23rd British Machine Vision Conference (BMVC), Surrey, UK, 3–7 September 2012; p. 135. [Google Scholar] [CrossRef]
Zhang, L.; Wu, X.; Buades, A.; Li, X. Color demosaicking by local directional interpolation and nonlocal adaptive thresholding. J. Electron. Imag. 2011, 20, 023016. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 391–407. [Google Scholar] [CrossRef]
Kingma, D.; Ba, J. Adam: A method for stochastic optimization. arXiv 2015, arXiv:1412.6980. [Google Scholar]
Loshchilov, I.; Hutter, F. SGDR: Stochastic gradient descent with warm restarts. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Zhang, J.; Ghanem, B. ISTA-Net: Interpretable optimization-inspired deep network for image compressive sensing. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1828–1837. [Google Scholar] [CrossRef]
Mou, C.; Wang, Q.; Zhang, J. Deep generalized unfolding networks for image restoration. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17378–17389. [Google Scholar] [CrossRef]
Shen, M.; Gan, H.; Ning, C.; Hua, Y.; Zhang, T. TransCS: A Transformer-based hybrid architecture for image compressed sensing. IEEE Trans. Image Process. 2022, 31, 6991–7005. [Google Scholar] [CrossRef] [PubMed]
Gan, H.; Shen, M.; Hua, Y.; Ma, C.; Zhang, T. From patch to pixel: A Transformer-based hierarchical framework for compressive image sensing. IEEE Trans. Comput. Imaging 2023, 9, 133–146. [Google Scholar] [CrossRef]

Figure 1. Illustration of our TCR-Net framework, which contains adaptive compressed sampling and inverse mapping. Meanwhile, inverse mapping contains two steps; one is initial reconstruction (IR), the other is deep reconstruction (DR). In a block-based CS scheme, the original image

X

is divided into l non-overlapped

B \times B

blocks

{x_{i}}

, and they are sampled to obtain block measurements

{y_{i}}

, which are then initialized by IR to obtain a joint recovered estimation

X

by DR.

X^{(k - 1)}

and

Z^{(k - 1)}

are the inputs of the k-th iterative process.

Figure 1. Illustration of our TCR-Net framework, which contains adaptive compressed sampling and inverse mapping. Meanwhile, inverse mapping contains two steps; one is initial reconstruction (IR), the other is deep reconstruction (DR). In a block-based CS scheme, the original image

X

is divided into l non-overlapped

B \times B

blocks

{x_{i}}

, and they are sampled to obtain block measurements

{y_{i}}

, which are then initialized by IR to obtain a joint recovered estimation

X

by DR.

X^{(k - 1)}

and

Z^{(k - 1)}

are the inputs of the k-th iterative process.

Figure 2. Illustration of the k-th TC module. It contains a gradient descent proximal mapping module. The PM module includes feature extraction module

ξ (\cdot)

, attention module

A (\cdot)

, and feed-forward network module

ζ (\cdot)

. d means depth-wise convolution,

k \times k

represents

k \times k

convolution.

Figure 2. Illustration of the k-th TC module. It contains a gradient descent proximal mapping module. The PM module includes feature extraction module

ξ (\cdot)

, attention module

A (\cdot)

, and feed-forward network module

ζ (\cdot)

. d means depth-wise convolution,

k \times k

represents

k \times k

convolution.

Figure 3. Illustration of Dual-LKA, d represents depth-wise convolution,

d d

represents depth-wise dilation convolution, and

k \times k

represents

k \times k

convolution.

A (\cdot)

does not change the number of channels.

Figure 3. Illustration of Dual-LKA, d represents depth-wise convolution,

d d

represents depth-wise dilation convolution, and

k \times k

represents

k \times k

convolution.

A (\cdot)

does not change the number of channels.

Figure 4. Visual comparison of all the competing CS algorithms at a CS ratio of 0.10.

Figure 5. Real image reconstruction performance (y-axis) of our TCR-Net and some recent methods (ISTA-Net+ [41], OPINE-Net+ [17], AMP-Net-BM [18], DGU-Net+ [42], TransCS [43], TCS-Net [44]) under different parameter capacities (x-axis) on Set5 [36].

Figure 6. Reconstruction performance (PSNR/SSIM) with different measurement matrices at a CS ratio of 0.10.

Figure 7. The visualizations of the measurement matrix at a CS ratio of 0.10 in the frequency domain.

Figure 8. Visual comparison of

baby_GT

.

Figure 8. Visual comparison of

baby_GT

.

Figure 9. Comparison of robustness to Gaussian noise.

Table 1. Average PSNR (dB) and SSIM performance comparisons on datasets at multiple CS ratios.

Dataset	CS ratios	ISTA-Net+	OPINE-Net+	AMP-Net-BM	DGU-Net+	TransCS	TCS-Net	TCR-Net
Set5	0.01	18.5225/0.4408	21.8914/0.6101	22.4254/0.6185	22.4190/0.6237	-/-	22.7494/0.6003	23.0929/0.6367
	0.04	23.4528/0.6619	27.9457/0.8209	27.8246/0.8179	28.3861/0.8318	27.9142/0.8262	27.5483/0.8173	28.5561/0.8427
	0.10	28.6065/0.8315	32.5102/0.9058	32.1392/0.9031	32.8441/0.9111	32.2531/0.9138	31.4809/0.9067	33.1097/0.9244
	0.25	34.1672/0.9272	36.7785/0.9510	36.9258/0.9541	37.3302/0.9558	36.9154/0.9594	35.8560/0.9559	37.6505/0.9630
	0.50	39.4886/0.9706	41.6234/0.9779	42.1352/0.9804	42.4728/0.9809	42.1788/0.9825	-/-	42.7052/0.9842
McM18	0.01	19.9893/0.4942	23.4088/0.6316	23.7917/0.6431	23.0500/0.6372	-/-	23.6266/0.6144	24.0858/0.6427
	0.04	24.2732/0.6577	27.9489/0.7891	27.9164/0.7887	28.1609/0.7998	28.0115/0.7943	27.5373/0.7907	28.3934/0.8126
	0.10	28.5360/0.8104	31.9249/0.8878	31.7231/0.8869	32.3243/0.8977	31.8816/0.8964	30.9669/0.8913	32.5369/0.9103
	0.25	33.9880/0.9237	36.9213/0.9537	37.0400/0.9570	37.7359/0.9614	37.1519/0.9605	35.8945/0.9579	37.9561/0.9668
	0.50	39.5162/0.9728	42.2930/0.9834	43.0616/0.9866	43.6171/0.9875	42.9903/0.9872	-/-	43.8347/0.9894
BSD100	0.01	19.1925/0.4056	21.8885/0.5103	22.2577/0.5225	22.1172/0.5126	-/-	22.22/0.5029	22.7562/0.5248
	0.04	22.2471/0.5411	25.0002/0.6516	25.0882/0.6593	25.2757/0.6653	25.0518/0.6690	24.9076/0.6665	25.4058/0.6839
	0.10	25.0920/0.6841	27.5465/0.7715	27.6100/0.7786	27.8919/0.7847	27.5527/0.7937	27.1727/0.7903	28.0507/0.8062
	0.25	29.0354/0.8402	31.2010/0.8870	31.4876/0.8974	31.6787/0.8984	31.3815/0.9039	30.6594/0.9009	31.9835/0.9121
	0.50	33.7145/0.9371	36.0208/0.9582	36.7344/0.9658	36.7421/0.9656	36.4358/0.9668	-/-	37.1064/0.9704
General100	0.01	18.9989/0.4700	22.5268/0.6229	22.9131/0.6321	22.8558/0.6276	-/-	22.9139/0.6018	23.4813/0.6374
	0.04	23.7578/0.6549	27.6234/0.7865	27.4456/0.7846	27.9241/0.7969	27.6451/0.7947	27.2431/0.7888	28.2144/0.8104
	0.10	28.5443/0.8104	32.0279/0.8863	31.5631/0.8838	32.4102/0.8968	31.7109/0.8964	30.8719/0.8895	32.7478/0.9093
	0.25	34.3164/0.9250	37.1454/0.9530	36.9876/0.3552	37.5467/0.9598	37.2737/0.9614	35.7811/0.9568	38.2174/0.9667
	0.50	39.9733/0.9740	42.5183/0.9835	42.8420/0.9857	43.2621/0.9869	42.9651/0.9874	-/-	43.8010/0.9891
Avg.	0.01	19.1550/0.4424	22.2975/0.5728	22.6792/0.5835	22.5305/0.5767	-/-	22.6566/0.5584	23.1962/0.5873
	0.04	23.1151/0.6043	26.4806/0.7270	26.4350/0.7295	26.7659/0.7389	26.5178/0.7390	26.2264/0.7347	26.9770/0.7546
	0.10	26.9969/0.7542	30.0208/0.8354	29.8162/0.8373	30.3869/0.8469	29.8722/0.8507	29.2344/0.8455	30.6326/0.8635
	0.25	31.9184/0.8869	34.4534/0.9234	34.5241/0.6603	34.9257/0.9323	34.6135/0.9354	33.4952/0.9318	35.3881/0.9421
	0.50	37.1189/0.9573	39.5664/0.9720	40.1050/0.9767	40.3492/0.9772	40.0215/0.9780	-/-	40.7771/0.9806

Table 2. Comparison of parameters, model size, and computational time.

	ISTA-Net+	OPINE-Net+	AMP-Net-BM	DGU-Net+	TransCS	TCS-Net	TCR-Net
Time (s)	0.0071	0.0088	0.0383	0.0278	0.0360	0.0212	0.0128
Parameters (M)	0.34	0.62	0.58	6.92	1.44	0.52	0.37
Size (MB)	1.4	2.5	2.4	9.6	21.4	2.1	4.7

Table 3. Average PSNR (dB) performance comparisons on datasets at a CS ratio of 0.25.

Datasets	TCR-Net-(7, 7, 1)	TCR-Net-(3, 5, 1)	TCR-Net
Set5	37.5737	37.5016	37.6505
McM18	37.8523	37.7893	37.9561
BSD100	31.9179	31.8828	31.9835
General100	38.0839	38.0262	38.2174
Avg.	35.2887	35.2404	35.3881

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nan, R.; Sun, G.; Zheng, B.; Zhang, P. Hybrid Transformer and Convolution for Image Compressed Sensing. Electronics 2024, 13, 3496. https://doi.org/10.3390/electronics13173496

AMA Style

Nan R, Sun G, Zheng B, Zhang P. Hybrid Transformer and Convolution for Image Compressed Sensing. Electronics. 2024; 13(17):3496. https://doi.org/10.3390/electronics13173496

Chicago/Turabian Style

Nan, Ruili, Guiling Sun, Bowen Zheng, and Pengchen Zhang. 2024. "Hybrid Transformer and Convolution for Image Compressed Sensing" Electronics 13, no. 17: 3496. https://doi.org/10.3390/electronics13173496

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hybrid Transformer and Convolution for Image Compressed Sensing

Abstract

1. Introduction

2. Related Work

2.1. Deep Unfolding Network

2.2. Transformer

3. Proposed Method

3.1. Overall Architecture

3.2. Architecture Design of TCR-Net

3.2.1. Sampling and Initial Reconstruction

3.2.2. Deep Reconstruction

3.3. Loss Function and Network Parameters

4. Experiment

4.1. Experimental Settings

4.1.1. Datasets and Performance Measures

4.1.2. Implementation Details

4.2. Comparisons with State-of-the-Art Methods

4.3. Ablation Studies and Discussions

4.3.1. Measurement Matrix

4.3.2. Dual-LKA

4.3.3. Sensitivity to Noise

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI