Next Article in Journal
DGD-CNet: Denoising Gated Recurrent Unit with a Dropout-Based CSI Network for IRS-Aided Massive MIMO Systems
Previous Article in Journal
How Effective Are Forecasting Models in Predicting Effects of Exoskeletons on Fatigue Progression?
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FusionOpt-Net: A Transformer-Based Compressive Sensing Reconstruction Algorithm

Beijing Electronic Science and Technology Institute, Beijing 100070, China
*
Author to whom correspondence should be addressed.
Sensors 2024, 24(18), 5976; https://doi.org/10.3390/s24185976
Submission received: 6 August 2024 / Revised: 1 September 2024 / Accepted: 9 September 2024 / Published: 14 September 2024
(This article belongs to the Section Intelligent Sensors)

Abstract

:
Compressive sensing (CS) is a notable technique in signal processing, especially in multimedia, as it allows for simultaneous signal acquisition and dimensionality reduction. Recent advancements in deep learning (DL) have led to the creation of deep unfolding architectures, which overcome the inefficiency and subpar quality of traditional CS reconstruction methods. In this paper, we introduce a novel CS image reconstruction algorithm that leverages the strengths of the fast iterative shrinkage-thresholding algorithm (FISTA) and modern Transformer networks. To enhance computational efficiency, we employ a block-based sampling approach in the sampling module. By mapping FISTA’s iterative process onto neural networks in the reconstruction module, we address the hyperparameter challenges of traditional algorithms, thereby improving reconstruction efficiency. Moreover, the robust feature extraction capabilities of Transformer networks significantly enhance image reconstruction quality. Experimental results show that the FusionOpt-Net model surpasses other advanced methods on various public benchmark datasets.

1. Introduction

The demand for IoT solutions is currently experiencing significant growth, with the IoT community and the majority of IoT terminal markets showing strong positive sentiment. In terms of device scale, a report by IoTAnalytics projects that there will be approximately 27 billion IoT devices by 2025 [1]. Vision is considered the most critical and convenient form of perception, and visual data (such as images and videos) have become the preferred medium for information exchange within the IoT. It is reported that approximately 1.81 trillion photos are taken globally each year, and by 2030, the total number of photos taken is expected to reach 28.6 trillion. Of these, an estimated 6% will be shared and transmitted over the IoT to meet various needs. However, the rapid annual growth in the volume of data that IoT devices need to process has led to a significant challenge. Many IoT devices are resource-constrained, with limited data processing capabilities and strict energy consumption requirements. This necessitates the development of new solutions to efficiently handle data and support the execution of intelligent tasks within the IoT. For example, Yiting Lin [2] proposed an image compression and reconstruction algorithm based on compressed sensing that addresses these challenges.
Compressed sensing (CS) [3] is a recently developed signal acquisition, processing, and compression technique. It breaks through the limitations of the traditional Nyquist/Shannon [4,5] sampling theorem. Since its introduction by Candes, Tao, and Donoho in 2006, this theory has shown that it is possible to recover high-dimensional sparse signals from a small number of linear, non-adaptive measurements. This is achievable by solving an optimization problem, even when the measurement count is substantially less than what the Nyquist/Shannon theorem prescribes. Despite reducing the sampling rate, CS still allows for the efficient recovery of signals, making it a promising approach for IoT applications.
To address the optimization problem in CS, several efficient algorithms have been developed, including iterative hard thresholding (IHT) [6], iterative shrinkage-thresholding algorithm (ISTA) [7], fast iterative shrinkage-thresholding algorithm (FISTA) [8], and approximate message passing (AMP) [9].
The original CS problem involves finding the sparsest solution, defined as
min x x 0 subject to y = Φ x .
Given noisy measurements y , the CS problem is typically solved as
min x 1 2 Φ x y 2 2 + λ x 1
where Φ x y 2 2 represents the data fidelity term, and  λ is a regularization parameter. For example, ISTA updates the estimate as
r ( k ) = x ( k ) ρ Φ T ( Φ x ( k ) y )
and applies the thresholding operation:
x ( k + 1 ) = arg min x 1 2 x r ( k ) 2 2 + λ x 1
where k is the iteration step, and  ρ controls the convergence speed and accuracy of the thresholding process.
The primary drawback of traditional reconstruction algorithms lies in their slow convergence speed. Due to the requirement for extensive iterations, significant computational resources are consumed when dealing with large-scale or high-dimensional datasets, making it difficult to meet efficiency demands. Additionally, the performance of traditional reconstruction algorithms highly depends on the selection of preset parameters, such as regularization thresholds and step sizes. These parameters often need experimental tuning, which increases the algorithm’s complexity and usability challenges.
In recent years, with the significant success of emerging deep learning (DL) techniques in computer vision, numerous DL-based models have been proposed for compressive sensing (CS) image reconstruction, such as LISTA [10], ISTA-Net [11], and FISTA-Net [12]. Compared to traditional algorithms, these DL-based CS algorithms leverage extensive training data to learn complex signal features, thereby achieving higher quality reconstructions that better preserve image details and textures. Moreover, deep learning models autonomously learn features from data without the need for manually designed feature extraction methods, addressing the hyperparameter issues of traditional algorithms like FISTA. This capability of automatic feature learning endows deep learning methods with notable advantages in handling complex and high-dimensional data. In fact, in addition to capturing local image features, the global spatial information of images is crucial. However, relying solely on convolutional neural networks (CNNs) to comprehensively learn global information may be limited due to the inherent constraints of stacked convolutional layers, such as effective receptive fields and the issue of redundant filters from over-parameterization. This approach could potentially constrain image reconstruction performance. Addressing this challenge, Shen et al. proposed the TransCS model [13], which introduces a custom ISTA-based Transformer backbone. This model applies iterative gradient descent updates and soft thresholding operations to represent the global spatial relationships among image patches. Nevertheless, the Transformer architecture in TransCS remains computationally complex. To simplify iteration counts and reduce computational resource consumption, we present FusionOpt-Net, a CS model based on Transformer and FISTA algorithms. FusionOpt-Net incorporates a momentum factor and novel sequences to accelerate convergence, while integrating Transformer’s global features to achieve superior image reconstruction performance.
FusionOpt-Net, with its high image reconstruction performance and fast computational speed, is particularly effective in eliminating blocking artifacts and restoring image details even at low sampling rates. This makes it highly suitable for real-time image reconstruction tasks, such as video compression and transmission [14]. Moreover, FusionOpt-Net demonstrates excellent image reconstruction capabilities in noisy environments, maintaining high PSNR and SSIM metrics even in the presence of multiple levels of Gaussian noise. This suggests that the model is applicable in fields requiring high-quality image reconstruction in noisy conditions, such as medical imaging [15] and remote sensing image processing. The primary contributions of this paper are as follows:
  • We propose an innovative framework that integrates FISTA with Transformer networks. Through this integration, we leverage the fast convergence properties of FISTA and the powerful feature extraction capabilities of Transformer networks to significantly enhance the performance of compressive sensing image reconstruction;
  • We conducted experiments on several public datasets to validate that the proposed FusionOpt-Net model outperforms other image-compression-aware reconstruction models significantly in terms of visual representation and quantitative performance metrics.

2. Related Work

2.1. FISTA Algorithm

The fast iterative shrinkage-thresholding algorithm (FISTA) is an accelerated gradient-based method designed to solve sparse linear inverse problems. It builds upon the traditional iterative shrinkage-thresholding algorithm (ISTA) by incorporating momentum acceleration, which significantly enhances convergence speed. Due to its efficiency, FISTA has been widely adopted in fields such as compressed sensing and image reconstruction.
FISTA is formulated to solve optimization problems of the form
min x F ( x ) = f ( x ) + g ( x ) .
Here, f ( x ) represents a smooth convex function, typically associated with data fidelity, and is expressed as
f ( x ) = 1 2 A x b 2 2
g ( x ) is a non-smooth but convex regularization term, often chosen as the L1 norm:
g ( x ) = λ x 1 .
Initialization: The algorithm starts with an initial point x 1 = y 1 and an initial step size parameter t 1 = 1 .
Iterative Update: In each iteration, the following update rules are applied:
x k + 1 = prox γ g y k γ f ( y k )
where γ is the step size, typically set to γ = 1 L , with L being the Lipschitz constant of the smooth function f ( x ) . The function prox γ g ( v ) denotes the proximal operator associated with g ( x ) , defined as
prox γ g ( v ) = arg min x 1 2 γ x v 2 2 + g ( x ) .
The acceleration parameter t k + 1 is then updated as follows:
t k + 1 = 1 + 1 + 4 t k 2 2 .
Finally, the auxiliary variable y k + 1 is updated using
y k + 1 = x k + 1 + t k 1 t k + 1 x k + 1 x k .
Termination: The iterative process continues until a predefined convergence criterion is satisfied, such as when the difference x k + 1 x k falls below a certain threshold.
FISTA’s primary advantage lies in its enhanced convergence rate and ease of implementation. Specifically, compared to traditional gradient descent and ISTA, FISTA achieves a faster convergence rate by utilizing Nesterov’s momentum. This improvement leads to a theoretical convergence rate of O ( 1 / k 2 ) compared to ISTA’s O ( 1 / k ) , making it highly effective for large-scale sparse problems. Furthermore, despite the inclusion of momentum, FISTA maintains a computational complexity comparable to ISTA, ensuring both efficient implementation and execution. Moreover, FISTA exhibits great flexibility, as it can be adapted to various regularization terms, such as L1 and L2 norms, making it applicable to a broad range of sparse optimization problems. This adaptability has contributed to FISTA’s widespread use as a reliable tool in areas like compressed sensing and image reconstruction.

2.2. Transformer

The Transformer [16] is a deep learning architecture known for its reliance on the self-attention mechanism, which allows it to capture long-range dependencies in sequential data more effectively than traditional RNNs. Its multi-head attention further enhances the model’s ability to learn diverse patterns by processing multiple attention layers in parallel. Unlike RNNs, the Transformer operates with full parallelism, significantly improving training efficiency. Additionally, positional encoding is used to maintain the order of sequences, while residual connections and layer normalization ensure stable training. These features make the Transformer a highly flexible and powerful model, applicable across various domains including natural language processing and computer vision.
While the Transformer has become the standard for NLP tasks, its application in visual tasks still requires more exploration. An experimental approach to image compressed sensing (CS) is CSformer, which adopts a dual-stream, black-box strategy to merge intermediate features from both Transformer and CNN. In contrast, another work, TransCS, applies global attention to natural images through an iterative process, which can be regarded as an unfolded ISTA recovery framework. This method iteratively conducts gradient descent updates and soft-thresholding, providing well-defined interpretability. Additionally, by integrating Transformer and CNN into a hybrid architecture, TransCS excels at managing the relationships between high-level visual semantic features. Consequently, TransCS capitalizes on the strengths of both Transformer and CNN for image CS, learning global dependencies and local features of image patches, leading to hybrid image reconstruction with high recovery quality. However, the traditional ISTA algorithm used in TransCS, although resolving inherent hyperparameter challenges, suffers from slow convergence and low efficiency. To overcome this limitation, we combine the FISTA algorithm with Transformer, incorporating learnable momentum, which not only accelerates convergence but also preserves high reconstruction accuracy.

2.3. Deep Compressed Sensing

The fundamental idea behind deep compression sensing (DCS) is to utilize a neural network to learn the complex relationship between measurements and the original signal. This approach enhances both the speed and precision of the reconstruction process, thereby improving the overall performance in image sampling and reconstruction. Typically, DCS aims to minimize the expression x g Φ ( y ) 2 2 , where x represents the source signal, and  y denotes the observation, serving as the network input. The inverse transformation function, determined by the network’s parameters Φ , is optimized through this process. With the ongoing advancements in deep learning, a growing number of DCS algorithms are being introduced.
These algorithms generally fall into two main categories. The first type integrates traditional CS algorithms with deep learning, employing neural networks for both implementation and computation in an iterative manner. This method maintains the stability and dependability of conventional algorithms while enhancing reconstruction quality and speed through deep learning. For example, ISTA-Net substitutes the sparsity constraints in the linear transform domain of traditional optimization-based spreading algorithms with constraints in the nonlinear transform domain of the network. A similar approach is employed in ADMM-CSNet [17], which builds upon the ADMM algorithm. Although these models utilize a data-driven method for reconstruction, they continue to rely on the traditional, manually designed sensing matrix within the sampling module, potentially limiting reconstruction performance. Additionally, NeumNet [18] was introduced by Gilton et al. as a solution for image inverse problems using the Neumann series. While NeumNet offers high-speed image reconstruction, the resulting images are still significantly impacted by blocking artifacts. AMP-Net incorporates the unfolding algorithm AMP into a neural network structure, extending its capabilities. TransCS, on the other hand, introduces a Transformer-based network built on ISTA that captures global dependencies between image sub-blocks while iteratively applying gradient descent and soft-thresholding operations. Furthermore, DRCAMP-Net [19] integrates AMP with extended residual convolution to mitigate block artifacts and broaden the receptive field.
Another approach focuses on deep learning models built on convolutional neural networks (CNNs). These models reconstruct images by stacking convolutional layers, prioritizing the retention of local image features. For example, DR2-Net [20] leverages linear mapping and residual networks for initial and final image reconstruction, while ReconNet achieves this directly through convolutional layers. DPA-Net [21] enhances reconstruction quality by preserving texture details, and MSCRLNet [22] uses multi-scale residual networks to improve shallow feature extraction by concentrating on channels. However, due to the inherent locality of convolutional layers, CNN-based models have limitations in capturing global positional relationships. To address global dependencies, these models often resort to inefficient stacking of convolutional layers to expand the receptive field. Thus, there is a clear need to establish a new DL-based image CS paradigm that effectively captures global relationships among image subblocks.

3. FusionOpt-Net Module

We propose a novel algorithmic framework that integrates the iterative process of FISTA with the feature extraction process of Transformer networks. This architecture combines the iterative algorithm of FISTA-Net with the deep self-attention mechanism of TransCS, achieving superior image reconstruction performance through technical fusion. The data flow is illustrated in Figure 1.
The core of the FusionOpt-Net model is the FISTA-based Transformer backbone. We customize the traditional FISTA by embedding it into the Transformer architecture. This customization allows the Transformer to effectively model the global dependencies among image subblocks, which are crucial for accurately reconstructing images from compressed measurements. In each iteration, the model performs a gradient descent update followed by a soft thresholding operation, which is a typical step in FISTA. This process is then integrated into the Transformer’s multi-head self-attention mechanism. By doing so, the model not only captures local image features but also effectively models the long-range dependencies across the entire image, which is essential for high-quality reconstruction.
In the model architecture diagram, from “stage 1” to “stage n”, each stage has a clear momentum update mechanism designed to accelerate the convergence of the model. As seen in the diagram, the output of each stage (after processing by the proximal mapping module) is combined with the output from the previous stage, and through a weighted summation (including the momentum term ρ ( k ) ), the input for the next stage is formed. This structure, by adding a momentum term, optimizes the current update step by utilizing a linear combination of the previous two iteration results during each iteration. This not only helps to accelerate convergence but also effectively mitigates oscillations during the iterative process. Our momentum module is designed as a learnable parameter, and during the entire network training process, these momentum parameters dynamically adjust according to the specific task requirements to ensure faster convergence and higher performance.

3.1. Sampling Module

In order to achieve better image reconstruction results, the sampling module of FusionOpt-Net utilizes a data-driven trainable sensing matrix. The sampling module uses a partition function F B ( · ) to divide the original image x into B × B non-overlapping blocks, followed by a flattening function F vec ( · ) that projects the blocks into vectors. The sensing matrix φ is trained through backpropagation using training images, ultimately conforming to a Gaussian distribution. Therefore, the sampling module can be expressed as
y = S ( x , φ ) = φ · F vec F B ( · )
where S ( · , φ ) signifies the sampling process. Compared to random sensing matrices, the learned ones are more efficient for hardware implementation and demand less storage capacity.

3.2. Reconstruction Module

The FusionOpt-Net reconstruction module includes two submodules: initial reconstruction and deep reconstruction.

3.2.1. Initial Reconstruction

The initial reconstruction module is a key component of the FusionOpt-Net framework, with its primary task being the initial reconstruction of the image after sampling. This module is implemented through a trainable initial reconstruction matrix φ ˜ . The matrix φ ˜ is initialized as the transpose of the sampling matrix φ , φ ˜ = φ T . This initialization method leverages the structural information of the sampling matrix, contributing to the stability of the initial reconstruction. The sampled image representation y undergoes a linear transformation using the initial reconstruction matrix φ ˜ , yielding the initial reconstructed image x int . This process is expressed as
x int = I ( y , φ ˜ ) = φ ˜ · y
where I ( · , φ ˜ ) represents the initial reconstruction operation. Relying solely on the initial reconstruction module may lead to artifacts and missing details in the initial reconstructed image because the initial reconstruction process only performs a simple linear transformation. To improve reconstruction quality and reduce artifacts, the deep reconstruction module further refines the reconstruction based on this initial output.

3.2.2. Deep Reconstruction

The deep reconstruction module D ( · ) is implemented using a Transformer backbone network and CNN based on FISTA. The Transformer backbone network guides the solution of the general 1 -norm optimization problem at each layer, where the threshold and shrinkage values are updated in each iteration. The momentum ρ is learned automatically from the training data. r k represents the residual in FISTA, while the current estimate x k is obtained from the previous estimate x k 1 .
Inspired by TransCS, we designed a function F B ( · ) that partitions the B × B input into non-overlapping B × B blocks. The iterative shrinkage-thresholding operation is then expressed as
x o k = F B r k λ k φ T φ · r k y
where r k is the input at the k-th iteration, x c k denotes the output at the same iteration, and  λ k is the step size updated at each iteration according to traditional FISTA.
Next, the pre-processing module (piling residual layers) refines the output x c k to reduce noise and preserve high-quality details through convolutional layers, learned from training data. This process is expressed as
x pre k = F vec 1 x 0 k C pre k F vec 1 x 0 k
where F vec 1 ( · ) denotes the inverse vectorization function, and  C pre k ( · ) represents the k-th convolutional layer. The pre-processing module consists of six layers, each with a 3 × 3 kernel size. The first and last layers have one channel, while the middle FusionOpt-Net layers have 32 channels.
In the coding module of the Transformer, FusionOpt-Net deep reconstruction uses the embedded positions of image patches to compress the sequence of input tokens. The positional encoding (PE) is used to retain the spatial relationships between image patches. The final result is a matrix representing the encoded sequence, expressed as
x en k = T en k F p x pre k + P E F p x pre k
where T en k is the Transformer encoder function, and  F p ( · ) represents a function that partitions the input into non-overlapping blocks. After encoding, the representation x en k undergoes element-wise soft thresholding to reduce noise and improve sparsity. This process is expressed as
x soft k = F sgn ( x en k ) · F act ( x abs ( x en k ) ζ k )
where F sgn ( · ) is the sign function, F act ( · ) is an activation function, F abs ( · ) is the absolute value function, and  ζ k is the current threshold.
The result of the soft thresholding x soft k is combined with the pre-processed result x pre k , and a weighted update is performed:
x de k = x pre k η k · T de k x pre k , x soft k
where η k is the weight factor, and  T de k ( · ) represents the decoder function at the k-th iteration.
After the deep module, a post-processing module is designed, which is expressed as
x post k = F B 1 ( x de k ) C post k F B 1 ( x de k )
where F B 1 ( · ) represents the inverse partition function, and  C post k ( · ) denotes the convolutional layer configuration in the post-processing block.
The updated vectorization function F vec ( · ) reprojects the reconstruction result, allowing the image blocks to proceed to the next iteration:
x k + 1 = F vec ( x post k )
Intermediate variables r k + 1 and ρ k + 1 are updated for the next iteration with momentum strategies to accelerate convergence:
r k + 1 = x k + 1 + ρ k ( x k + 1 x k )
The final reconstruction result is obtained by applying the inverse vectorization function F B 1 ( · ) to the final iteration output:
x ^ = F B 1 F vec 1 ( x n + 1 )
The parameter changes during each iteration can follow a predetermined pattern. Consequently, we present Algorithm 1 to illustrate the reconstruction process.
Algorithm 1 Forward Propagation for Image Recover
Require: number of iteration stages n, initial reconstruction matrix φ ˜ , soft thresholds ζ 1 n ,
   weight coefficients η 1 n , iteration step size l 1 n , measurements y , scalar for momentum
   update ρ 1 n , sensing matrix φ
Ensure: reconstructed image x ^
  1:
Trainable hyperparameters: φ ˜ , l 1 n , t 1 n , η 1 n
  2:
Initialization: X init = I ( y , φ ˜ )
  3:
Begin the iteration: k = 1 , r 1 = X init = x 1
  4:
while  k n  do
  5:
     x o k = F B r k λ k φ T φ · r k y
  6:
     x pre k = F vec 1 x o k C pre k F vec 1 x o k
  7:
     x en k = T en k F p x pre k + P E F p x pre k
  8:
     x soft k = F sgn ( x en k ) · F act ( x abs ( x en k ) ζ k )
  9:
     x de k = x pre k η k · T de k x pre k , x soft k
10:
     x post k = F B 1 ( x d e k ) C p o s t k F B 1 ( x d e k )
11:
     x k + 1 = F vec ( x post k )
12:
     r k + 1 = x k + 1 + ρ k ( x k + 1 x k )
13:
     k = k + 1
14:
end while
15:
x ^ = F B 1 F vec 1 ( x n + 1 )

3.3. Loss Function

During the training of FusionOpt-Net, we simultaneously refine the sampling module S ( · , φ ) and the recover module D ( I ( · , φ ˜ ) ) , with the original images serving as both inputs and training labels. The parameters to be trained for the k-th stage of the deep reconstruction are denoted by ω k , while for the n stages, the collective trainable parameters are indicated by ω 1 n . To automatically train the initialization and deep reconstruction modules from the measured values y , we measure the differences between the source and recovered images using mean squared error (MSE). We define the loss function as
L total ( φ , φ ˜ , ω 1 n ) : = 1 2 n i = 1 n D I S ( x i , φ ) , φ ˜ x i 2 2
where x i represents the i-th training image, and n is the total number of training images.

4. Experimental Results

In this section, several experiments are conducted to verify the performance of the proposed method. Firstly, in Section 3.2, a comparison between the FusionOpt-Net method and other models is performed on several public datasets. Subsequently, in Section 3.3, the robustness of the FusionOpt-Net method is tested on multi-level Gaussian noise images.

4.1. Experimental Settings

The FusionOpt-Net training dataset is derived from the BSD500 dataset [23], comprising 200 training images, 100 validation images, and 200 testing images. The validation dataset we use is Set11. We randomly segment images in the training dataset into 200 sub-images, each measuring 96 × 96 pixels, creating a total of 100,000 sub-images. To augment the data, we apply random horizontal and vertical flips, rotations, and scaling to enhance image diversity. The experimental results are evaluated using three widely used benchmarks: Set11 [24], BSD200 [23], and Urban100 [25].
The FusionOpt-Net training process follows the same settings as the DL-based CS method (such as ISTA-Net). The patch size P is set to 8, the initial step size is 1.0, and the regularization parameter λ is initialized to 0.1. The initial value of ρ is 0.01, and the number of iterations H is set to 8. Training is conducted for 200 epochs with a batch configuration of 64. The learning rate is set to decay from the 101st to the 150th epoch, and the last 50 epochs are trained with a constant learning rate. We use the Adam optimizer for training. The FusionOpt-Net model is compared with several state-of-the-art methods, including CSformer [26], ISTA-Net+ [11], CSNet [27], AMP-Net [28], and TransCS, which are all based on traditional algorithms combined with deep learning models. Performance evaluation is carried out using perceptual metrics, PSNR, and SSIM. The better performance of the method is indicated by higher PSNR and SSIM values. The models compared to FusionOpt-Net are obtained from their respective sources and executed with default configurations. To ensure an unbiased evaluation, all training images for the rival models are sourced from the BSD500 dataset. The experiments are conducted using the PyTorch 1.9.0 framework on a system with an Intel Xeon 8336 CPU and a GeForce RTX 4090 GPU.

4.2. Comparisons with State-of-the-Art Methods

In our study, we conducted a comprehensive evaluation of Csformer, ISTA-Net+, AMP-Net, CsNet, TransCS, and our proposed model across the Set11, BSD200, and Urban100 datasets at sampling rates of 0.04, 0.1, 0.25, and 0.5. The evaluation metrics used were peak signal-to-noise ratio, (PSNR, dB) and structural similarity index (SSIM). Table 1 presents the detailed experimental results.
The FusionOpt-Net model consistently demonstrated significant advantages across all datasets and sampling rates:
  • Set11 Dataset: At a sampling rate of 0.04, our model achieved a PSNR of 25.34 and SSIM of 0.7815, both superior to other models. At higher rates like 0.5, our model further demonstrated superiority with a PSNR of 39.91 and SSIM of 0.9809, notably higher than ISTA-Net+ (38.07) and TransCS (38.88).
  • BSD200 Dataset: Across various sampling rates, our model consistently outperformed others. For instance, at a rate of 0.25, our model achieved a PSNR of 31.91 and SSIM of 0.9237, surpassing TransCS (PSNR 31, SSIM 0.9171) and ISTA-Net+ (PSNR 29.51, SSIM 0.8659). At a 0.5 sampling rate, our model reached a PSNR of 37.06 and SSIM of 0.9748, reaffirming its superior performance.
  • Urban100 Dataset: At a low sampling rate of 0.04, our model led with a PSNR of 22.05 and SSIM of 0.6619. At a 0.5 sampling rate, our model achieved a PSNR of 35.51 and SSIM of 0.9758, significantly surpassing ISTA-Net+ (PSNR 34.58, SSIM 0.9661) and TransCS (PSNR 34.16, SSIM 0.9687).
On average, our model exhibited the highest PSNR (32.32) and SSIM (0.9), significantly outperforming other models. These results underscore the capability of our model to consistently deliver high-quality reconstructed images across different datasets and sampling rates. Moreover, its robust performance at low sampling rates (e.g., 0.04 and 0.1) highlights its efficacy in sparse data scenarios.
Visual comparisons between our model and competing methods further validate our findings. As shown in Figure 2, Our approach excelled in detail preservation, texture reconstruction, and edge sharpness, notably outperforming ISTA-Net+, AMP-Net, and TransCS. Specifically, our model accurately reproduced complex structures, such as natural shadow transitions in portrait images and sharp patterns in butterfly wings, reducing blurring effects significantly compared to other methods.
In conclusion, our model exhibits superior performance in compressive sensing image reconstruction tasks, as evidenced by both quantitative metrics (PSNR, SSIM) and qualitative visual assessments. These findings underscore the effectiveness and potential application value of our proposed method in the field of image reconstruction.

4.3. Noise Robustness

To assess image reconstruction robustness in various noisy environments, Gaussian noise with a mean of zero and standard deviations σ 0.001 , 0.002 , 0.004 was added to the BSD200 test dataset. The noise robustness of FusionOpt-Net was compared with three deep learning models (ISTA-Net+, AMP-Net, and TransCS). PSNR and SSIM metrics were used for evaluation at four sampling rates ϵ 0.04 , 0.1 , 0.25 , 0.5 , along with visual noise level comparisons. Additionally, average PSNR and SSIM values for different noise levels are provided for all four reconstruction methods.The results are shown in Table 2.
Firstly, under different noise levels, the average peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) are compared. Regardless of the noise level, the FusionOpt-Net model achieves the highest PSNR and SSIM values in most situations. This indicates that the FusionOpt-Net model maintains superior image quality and structural details over other models in varying noise levels. For instance, at a noise level of σ = 0.004 , the PSNR of the FusionOpt-Net model only decreases by 2.86% from 28.26 to 23.68, while the PSNR of ISTA-Net+ dropped from 25.76 to 23.68. These data indicate that the FusionOpt-Net model maintains a higher PSNR in noisier environments and outperforms other models in preserving visual structures. The FusionOpt-Net model is able to effectively suppress noise and preserve the high-frequency information of the image structure under various noise conditions.
In particular, under high noise levels and sample rates, the FusionOpt-Net model still demonstrates excellent visual consistency and structural delineation.As shown in Figure 3. Compared to ISTA-Net+, AMP-Net, and TransCS, the FusionOpt-Net model shows stronger robustness and reliability in various noise environments, achieving superior visual effects and reliable results in practical applications.

4.4. Complexity Analysis

We conduct a model complexity analysis of MMR-CSNet and several competing methods (ISTA+, CSNet, CSformer, AMP-9BM, TransCS) across three dimensions: average runtime, the number of giga floating-point operations (GFLOPs), and the number of parameters. The average runtime assesses the time required for the model to compress and reconstruct an image. GFLOPs are used to evaluate the computational complexity, while the number of parameters reflects the spatial complexity of the model. These metrics are derived by forward propagating a single 256 × 256 image at a 0.1 sampling rate. As illustrated in Table 3 and Figure 4.
FusionOpt-Net achieves a computational time of 0.026 s on the RTX 4090D GPU for τ = 0.1 , making it highly efficient and suitable for real-time applications. Although slightly slower than the fastest model, CSNet (0.008 s), it remains competitive with methods like ISTA-Net+ (0.023 s) and AMP-Net (0.017 s), highlighting a balance between complexity and performance. The parameter count of 1.445 Mb, comparable to TransCS (1.489 Mb), reflects FusionOpt-Net’s enhanced feature extraction capabilities, justifying the trade-off for improved reconstruction quality and flexibility. With a moderate computational complexity of 12.011 GFLOPs, FusionOpt-Net is optimized for efficiency without compromising performance, making it a strong candidate for scenarios requiring high precision and resource-conscious deployments.

4.5. Ablation Studies

To verify the efficacy of the measurement reuse strategy, we further conduct ablation studies on BSDS100. The models compared include FusionOpt-Net and FusionOpt-Net without the momentum module. From the results, as shown in Figure 5, we can observe the following: The momentum module is useful for improving the reconstruction quality of an image. It plays a more important role, especially at high sampling rates. This is probably because the module acts as a residual-like structure in the overall structure, which improves the stability of the deep learning model during image reconstruction, resulting in a higher quality of the recovered result, which is more similar to the original image.

5. Future Work

The performance of FusionOpt-Net is potentially restricted by the fixed sampling block size used in the Transformer backbone. This fixed size may limit the flexibility and adaptability of the model to different types of images or scenarios where variable block sizes could be more effective. While FusionOpt-Net shows improved robustness to noise compared to other models, there is still room for enhancing its performance in extremely noisy environments. This suggests that the model’s ability to handle varying levels of noise is not fully optimized.
Future research could focus on developing a Transformer-based CS method with a dynamic and adaptive block size. This strategy would allow the model to adjust the block size based on the sampling matrix and the specific characteristics of different image areas, potentially leading to improved reconstruction performance. Another area of future work involves exploring more robust modifications of the Transformer architecture specifically designed for noisy image reconstruction scenarios. Enhancing the model’s ability to maintain high reconstruction quality in the presence of significant noise could further expand its applicability in real-world situations.

6. Conclusions

This paper introduces a novel compressed sensing image reconstruction algorithm that integrates the FISTA and Transformer networks. By combining the fast convergence properties of FISTA with the powerful feature extraction capabilities of Transformer networks, we have developed an efficient and high-quality image reconstruction method. The experimental results demonstrate that the FusionOpt-Net model exhibits significantly superior reconstruction performance across multiple image datasets, outperforming existing models such as ISTA-Net+ and TransCS in terms of metrics like PSNR and SSIM. Particularly noteworthy is its ability to preserve fine details and suppress noise effectively, especially in scenarios with high noise levels and low sampling rates, showcasing robustness in diverse environments.
In comparison to traditional algorithms, the FusionOpt-Net model not only addresses the complexity of hyperparameter tuning but also leverages deep learning to automatically learn features from data, thereby substantially enhancing image reconstruction quality. The future work will focus on further optimizing algorithmic efficiency and exploring its potential in other compressed sensing applications, aiming to achieve efficient and high-quality image reconstruction in broader contexts. This study provides new insights into advancing compressed sensing reconstruction algorithms and establishes a solid foundation for practical image reconstruction tasks.

Author Contributions

Conceptualization, H.Z. and B.C.; methodology, H.Z.; software, H.Z. and L.H.; validation, H.Z., X.G. and X.Y.; formal analysis, H.Z. and B.C.; data curation, H.Z.; writing—original draft preparation, H.Z. and B.C.; writing—review and editing, B.C.; funding acquisition, X.G. All authors have read and agreed to the published version of the manuscript.

Funding

Supported by “the Fundamental Research Funds for the Central Universities” (Grant Number: 3282024057).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest to report regarding the present study.

References

  1. Dharmadhikan, S. IoT Analytics Market Report 2023 (Global Edition); Report; IoT Analytics: Hamburg Germany, 2023. [Google Scholar]
  2. Lin, Y.; Xie, Z.; Chen, T.; Cheng, X.; Wen, H. Image privacy protection scheme based on high-quality reconstruction DCT compression and nonlinear dynamics. Expert Syst. Appl. 2024, 257, 124891. [Google Scholar] [CrossRef]
  3. Donoho, D. Compressed sensing. IEEE Trans. Inf. Theory 2006, 52, 1289–1306. [Google Scholar] [CrossRef]
  4. Nyquist, H. Certain Topics in Telegraph Transmission Theory. Trans. Am. Inst. Electr. Eng. 1928, 47, 617–644. [Google Scholar] [CrossRef]
  5. Shannon, C. Communication in the Presence of Noise. Proc. IRE 1949, 37, 10–21. [Google Scholar] [CrossRef]
  6. Blumensath, T.; Davies, M.E. Iterative hard thresholding for compressed sensing. Appl. Comput. Harmon. Anal. 2009, 27, 265–274. [Google Scholar] [CrossRef]
  7. Daubechies, I.; Defrise, M.; De Mol, C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. 2004, 57, 1413–1457. [Google Scholar] [CrossRef]
  8. Beck, A.; Teboulle, M. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM J. Imaging Sci. 2009, 2, 183–202. [Google Scholar] [CrossRef]
  9. Donoho, D.L.; Maleki, A.; Montanari, A. Message-passing algorithms for compressed sensing. Proc. Natl. Acad. Sci. USA 2009, 106, 18914–18919. [Google Scholar] [CrossRef]
  10. Gregor, K.; LeCun, Y. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on International Conference on Machine Learning, Madison, WI, USA, 21–24 June 2010; pp. 399–406. [Google Scholar]
  11. Zhang, J.; Ghanem, B. ISTA-Net: Interpretable Optimization-Inspired Deep Network for Image Compressive Sensing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  12. Xiang, J.; Dong, Y.; Yang, Y. FISTA-Net: Learning a Fast Iterative Shrinkage Thresholding Network for Inverse Problems in Imaging. IEEE Trans. Med. Imaging 2021, 40, 1329–1339. [Google Scholar] [CrossRef]
  13. Shen, M.; Gan, H.; Ning, C.; Hua, Y.; Zhang, T. TransCS: A Transformer-Based Hybrid Architecture for Image Compressed Sensing. IEEE Trans. Image Process. 2022, 31, 6991–7005. [Google Scholar] [CrossRef]
  14. Qiao, M.; Meng, Z.; Ma, J.; Yuan, X. Deep learning for video compressive sensing. APL Photonics 2020, 5, 030801. [Google Scholar] [CrossRef]
  15. Graff, C.G.; Sidky, E.Y. Compressive sensing in medical imaging. Appl. Opt. 2015, 54, C23–C44. [Google Scholar] [CrossRef] [PubMed]
  16. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar] [CrossRef]
  17. Yang, Y.; Sun, J.; Li, H.; Xu, Z. ADMM-CSNet: A Deep Learning Approach for Image Compressive Sensing. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 521–538. [Google Scholar] [CrossRef]
  18. Gilton, D.; Ongie, G.; Willett, R. Neumann networks for linear inverse problems in imaging. IEEE Trans. Comput. Imaging 2019, 6, 328–343. [Google Scholar] [CrossRef]
  19. Guo, Z.; Zhang, J. Lightweight Dilated Residual Convolution AMP Network for Image Compressed Sensing. In Proceedings of the 2023 4th International Conference on Computer Engineering and Application (ICCEA), Hangzhou, China, 7–9 April 2023; pp. 747–752. [Google Scholar] [CrossRef]
  20. Yao, H.; Dai, F.; Zhang, S.; Zhang, Y.; Tian, Q.; Xu, C. Dr2-net: Deep residual reconstruction network for image compressive sensing. Neurocomputing 2019, 359, 483–493. [Google Scholar] [CrossRef]
  21. Yu, F.; Qian, Y.; Zhang, X.; Gil-Ureta, F.; Jackson, B.; Bennett, E.; Zhang, H. DPA-Net: Structured 3D Abstraction from Sparse Views via Differentiable Primitive Assembly. arXiv 2024, arXiv:2404.00875. [Google Scholar]
  22. Bindels, D.S.; Haarbosch, L.; Van Weeren, L.; Postma, M.; Wiese, K.E.; Mastop, M.; Aumonier, S.; Gotthard, G.; Royant, A.; Hink, M.A.; et al. mScarlet: A bright monomeric red fluorescent protein for cellular imaging. Nat. Methods 2017, 14, 53–56. [Google Scholar] [CrossRef]
  23. Arbeláez, P.; Maire, M.; Fowlkes, C.; Malik, J. Contour Detection and Hierarchical Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 898–916. [Google Scholar] [CrossRef]
  24. Lohit, S.; Kulkarni, K.; Kerviche, R.; Turaga, P.; Ashok, A. Convolutional Neural Networks for Noniterative Reconstruction of Compressively Sensed Images. IEEE Trans. Comput. Imaging 2018, 4, 326–340. [Google Scholar] [CrossRef]
  25. Huang, J.B.; Singh, A.; Ahuja, N. Single Image Super-Resolution From Transformed Self-Exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  26. Ye, D.; Ni, Z.; Wang, H.; Zhang, J.; Wang, S.; Kwong, S. CSformer: Bridging Convolution and Transformer for Compressive Sensing. IEEE Trans. Image Process. 2023, 32, 2827–2842. [Google Scholar] [CrossRef] [PubMed]
  27. Shi, W.; Jiang, F.; Liu, S.; Zhao, D. Image Compressed Sensing Using Convolutional Neural Network. IEEE Trans. Image Process. 2020, 29, 375–388. [Google Scholar] [CrossRef] [PubMed]
  28. Zhang, Z.; Liu, Y.; Liu, J.; Wen, F.; Zhu, C. AMP-Net: Denoising-Based Deep Unfolding for Compressive Image Sensing. IEEE Trans. Image Process. 2021, 30, 1487–1500. [Google Scholar] [CrossRef] [PubMed]
Figure 1. FusionOpt-Net framework.
Figure 1. FusionOpt-Net framework.
Sensors 24 05976 g001
Figure 2. Reconstruction results for butterfly and bird images using FusionOpt-Net and other methods. Sampling rates τ are 0.04 for the first row and 0.25 for the second row. Please zoom in for better comparison.
Figure 2. Reconstruction results for butterfly and bird images using FusionOpt-Net and other methods. Sampling rates τ are 0.04 for the first row and 0.25 for the second row. Please zoom in for better comparison.
Sensors 24 05976 g002
Figure 3. Noise Robustness Comparison. Visual analysis of different image CS methods on cactus and ship images from the BSD100 dataset at sampling rates τ 0.04 , 0.10 , 0.25 . Gaussian noise with variances σ 0.001 , 0.002 , 0.004 was introduced. Note the effectiveness in recovering the ship images.
Figure 3. Noise Robustness Comparison. Visual analysis of different image CS methods on cactus and ship images from the BSD100 dataset at sampling rates τ 0.04 , 0.10 , 0.25 . Gaussian noise with variances σ 0.001 , 0.002 , 0.004 was introduced. Note the effectiveness in recovering the ship images.
Sensors 24 05976 g003
Figure 4. Comparison of the number of GFLOPs required to run a 256 × 256 pixel image in the model and the number of model parameters for τ = 0.1 .
Figure 4. Comparison of the number of GFLOPs required to run a 256 × 256 pixel image in the model and the number of model parameters for τ = 0.1 .
Sensors 24 05976 g004
Figure 5. Comparison of visualizations with and without momentum at different sampling rates τ 0.04 , 0.25 .
Figure 5. Comparison of visualizations with and without momentum at different sampling rates τ 0.04 , 0.25 .
Sensors 24 05976 g005
Table 1. PSNR (dB) and SSIM assessment for various models on Set11, BSD200, and Urban100 datasets at different sampling rates τ 0.04 , 0.1 , 0.25 , 0.3 , 0.4 , 0.5 . Best performances are highlighted in bold.
Table 1. PSNR (dB) and SSIM assessment for various models on Set11, BSD200, and Urban100 datasets at different sampling rates τ 0.04 , 0.1 , 0.25 , 0.3 , 0.4 , 0.5 . Best performances are highlighted in bold.
DatasetsModels0.040.10.250.5Avg.
PSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
set11Csformer23.980.738826.570.841132.130.920438.920.987629.50.829
ISTA-Net+21.560.62426.490.803632.440.923738.070.970629.640.8305
AMP-Net21.680.622325.920.782832.050.913937.770.966629.360.8214
CsNet22.340.678826.520.798332.780.917838.630.975230.070.8425
TransCS23.620.725326.640.838933.690.944438.880.976630.710.8713
FusionOpt-Net25.340.781529.510.886934.510.950839.910.980932.320.9
BSD200Csformer23.780.657725.90.778329.640.894735.210.961228.630.823
ISTA-Net+22.190.568225.210.714929.510.865934.570.950927.870.775
AMP-Net22.480.583625.260.710829.580.859134.80.948928.030.7756
CsNet23.220.615725.680.745730.020.902335.020.957828.490.8054
TransCS23.860.663426.040.7804310.917135.830.969829.180.8327
FusionOpt-Net25.10.701427.880.825331.910.923737.060.974830.490.8563
Urban100Csformer20.220.524322.320.678828.20.884234.020.959726.190.7618
ISTA-Net+18.90.491322.60.69928.260.887334.580.966126.090.7609
AMP-Net19.190.491822.220.66827.680.867834.250.960625.840.7471
CsNet19.230.502222.460.698127.910.883434.430.964426.010.762
TransCS20.990.59823.290.748329.260.919634.160.968726.930.8087
FusionOpt-Net22.050.661925.490.825930.260.930835.510.975828.330.8486
Table 2. PSNR (dB) and SSIM comparisons on BSD200 with different noise levels σ and various sample rates τ . Highlight the best performance in bold.
Table 2. PSNR (dB) and SSIM comparisons on BSD200 with different noise levels σ and various sample rates τ . Highlight the best performance in bold.
σ τ ISTA-Net+AMP-NetTransCSFusionOpt-Net
PSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
0.0010.0421.630.459121.920.468823.680.55624.360.5849
0.124.130.5924.210.585225.450.666726.550.7086
0.2527.240.740727.330.734329.170.817529.380.822
0.530.630.864630.750.857132.40.900532.740.9067
Avg.25.90.663626.050.661426.680.735228.260.7556
0.0020.0421.220.404521.520.411823.220.496923.830.5244
0.123.450.529523.550.524824.760.606525.730.6486
0.2526.10.682726.220.677327.990.768228.140.7728
0.529.090.826229.240.819430.840.869731.10.8764
Avg.24.960.610725.130.608226.710.685327.20.7056
0.0040.0420.530.33820.850.342822.460.42322.970.4486
0.122.430.456722.560.452223.690.531424.520.5732
0.2524.560.614424.760.61226.390.706926.480.7122
0.527.210.782927.390.777428.860.832829.060.84
Avg.23.680.54823.890.546125.350.623525.760.6435
Table 3. Average running time comparison of different methods on RTX 4090D with τ = 0.1 .
Table 3. Average running time comparison of different methods on RTX 4090D with τ = 0.1 .
Methods τ = 0.1
GPU
Platform
ISTA-Net+0.023GPU: RTX 4090D
CSNet0.008
CSformer0.046
AMP-Net0.017
TransCS0.027
FusionOpt-Net0.026
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, H.; Chen, B.; Gao, X.; Yao, X.; Hou, L. FusionOpt-Net: A Transformer-Based Compressive Sensing Reconstruction Algorithm. Sensors 2024, 24, 5976. https://doi.org/10.3390/s24185976

AMA Style

Zhang H, Chen B, Gao X, Yao X, Hou L. FusionOpt-Net: A Transformer-Based Compressive Sensing Reconstruction Algorithm. Sensors. 2024; 24(18):5976. https://doi.org/10.3390/s24185976

Chicago/Turabian Style

Zhang, Honghao, Bi Chen, Xianwei Gao, Xiang Yao, and Linyu Hou. 2024. "FusionOpt-Net: A Transformer-Based Compressive Sensing Reconstruction Algorithm" Sensors 24, no. 18: 5976. https://doi.org/10.3390/s24185976

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop