Next Article in Journal
Optimized Radio Frequency Footprint Identification Based on UAV Telemetry Radios
Previous Article in Journal
Errors in Estimating Lower-Limb Joint Angles and Moments during Walking Based on Pelvic Accelerations: Influence of Virtual Inertial Measurement Unit’s Frontal Plane Misalignment
Previous Article in Special Issue
A New Approach for Super Resolution Object Detection Using an Image Slicing Algorithm and the Segment Anything Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Lightweight Single Image Super-Resolution via Efficient Mixture of Transformers and Convolutional Networks

College of Electronics and Information Engineering, Sichuan University, Chengdu 610065, China
*
Author to whom correspondence should be addressed.
Sensors 2024, 24(16), 5098; https://doi.org/10.3390/s24165098
Submission received: 4 July 2024 / Revised: 2 August 2024 / Accepted: 5 August 2024 / Published: 6 August 2024

Abstract

:
In this paper, we propose a Local Global Union Network (LGUN), which effectively combines the strengths of Transformers and Convolutional Networks to develop a lightweight and high-performance network suitable for Single Image Super-Resolution (SISR). Specifically, we make use of the advantages of Transformers to provide input-adaptation weighting and global context interaction. We also make use of the advantages of Convolutional Networks to include spatial inductive biases and local connectivity. In the shallow layer, the local spatial information is encoded by Multi-order Local Hierarchical Attention (MLHA). In the deeper layer, we utilize Dynamic Global Sparse Attention (DGSA), which is based on the Multi-stage Token Selection (MTS) strategy to model global context dependencies. Moreover, we also conduct extensive experiments on both natural and satellite datasets, acquired through optical and satellite sensors, respectively, demonstrating that LGUN outperforms existing methods.

1. Introduction

Single Image Super-Resolution (SISR) is a prominent research field in computer vision that focuses on enhancing the visual details and overall appearance of low-resolution (LR) images by generating high-resolution (HR) versions. It has diverse applications across domains such as surveillance [1,2,3,4], medical imaging [5,6], satellite imagery [7,8], and monitoring [9,10]. Recent advancements in SISR techniques have leveraged advanced algorithms and deep learning models to effectively recover missing high-frequency details and textures from LR inputs, enabling significant improvements in resolution and visual quality.
Convolutional Networks are widely adopted for various visual tasks, including SISR [11,12]. The inherent properties of convolutional operations, such as the ability to aggregate information from adjacent pixels or regions, e.g., 3 × 3 windows, make them effective at capturing spatially local patterns. These properties, including translation invariance, local connectivity, and the sliding-window strategy, provide valuable inductive biases. However, Convolutional Networks suffer from two main limitations. Firstly, they have a local receptive field, restricting their ability to model global context. Secondly, the interaction between spatial locations is fixed through a static convolutional kernel during inference, limiting their flexibility to adapt to varying input content. Transformers, on the other hand, offer a solution to address these limitations. By introducing self-attention (SA) in Vision Transformers (ViTs), global interactions can be explicitly modeled, and the importance of each token can be dynamically adjusted through attention scores computed between all pairs of tokens during inference. However, the computational complexity of Transformers, which grows quadratically with the token length N (or spatial resolution H W ), poses challenges for real-world applications on resource-constrained hardware. This leads to the following natural question: How can we effectively combine the strengths of Convolutional Networks and ViTs to develop a lightweight and high-performance network suitable for resource-constrained devices?
In this work, we address the aforementioned question by focusing on the design of a lightweight and high-performance network for SISR tasks. The performance of our work is shown in Figure 1 compared with others. Our proposed approach, named LGUN, leverages the advantages of Convolutional Networks, such as spatial inductive biases and local connectivity, as well as Transformers, which offer input-adaptation weighting and global context interaction. Therefore, our core concept is illustrated in Figure 2. Compared to uni-dimensional information communication, e.g., spatial-only communication such as EIMN [13] or channel-only communication such as Restormer [14], our method can achieve local spatial-wise aggregation and global channel-wise interaction simultaneously, both of which are crucial for SISR tasks. As commonly known, in Convolutional Networks, the shallow layers of a network employ convolutional filters with smaller receptive fields, capturing local patterns and features like edges, corners, and textures. These low-level features are extracted in the initial layers, providing local information about the input data. By stacking multiple building blocks, Convolutional Networks gradually enlarge their receptive fields, enabling the capture of large-range spatial context information. Based on this prior knowledge, as shown in Figure 3, we divide the core modules, named Local Global Union (LGU), into two stages: Multi-order Local Hierarchical Aggregation (MLHA) and Dynamic Global Sparse Attention (DGSA). In the shallow layers, we employ MLHA to encode local spatial information efficiently. This approach feeds each sub-branch with only a subset of the entire feature, facilitating the explicit learning of distinct feature patterns through the Split–Transform–Fusion (STF) strategy. In the deep layers, we introduce DGSA to model long-range non-local dependencies while obtaining an effective receptive field of H × W. DGSA operates across the feature dimension, utilizing interactions based on the cross-covariance matrix between keys and queries. Considering the potential negative impact of irrelevant or confusing information in the attention matrix, which other methods [14] fail to consider, we incorporate the Multi-stage Token Selection (MTS) strategy into DGSA, which selects multiple top-k similar attention matrices and masks out insignificant elements allocated with lower weights. This reduces redundancy in attention maps and suppresses interference from cluttered backgrounds. The proposed design is robust to changes in the input token length and decreases the computational complexity to O ( N C 2 ) , where C N .
Our contributions can be summarized as follows:
(1)
We propose LGUN, a hybridization structure designed for resource-constrained devices. It combines the strengths of Convolutional Networks and ViTs, allowing for effective encoding of both local processing and global interaction throughout the network by the proposed LGU.
(2)
In the shallow layer, we employ MLHA to focus on encoding local spatial information. By using the STF strategy, MLHA promotes the learning of different patterns while also saving computational resources. In the deep layer, we utilize DGSA based on the MTS strategy to model global context dependencies. This enhances the network’s ability to model complex image patterns with high adaptability and representational power.
(3)
Experimental results on popular benchmark datasets demonstrate the superiority of our method compared to other recently advanced Transformer-based approaches. Our method outperforms in both quantitative and qualitative evaluations, providing evidence for the effectiveness of the MLHA-with-STF strategy and the DGSA-with-MTS strategy.

2. Related Work

2.1. Convolutional Networks

Classical SISR. Since the introduction of SRCNN [15], Convolutional Networks have emerged as superior solutions for SISR tasks [16]. Over the past decade, numerous novel ideas have been proposed or introduced in this field. These include residual learning [11], densely connected networks [17], neural architecture search (NAS) [18], knowledge distillation [19], channel attention [20], spatial attention [21], non-local attention [22], SA [23], etc. The general trend towards achieving higher performance in SISR is to design deeper and more complex networks. However, these methods often come at the cost of increased computational requirements, making it challenging to deploy them on resource-constrained mobile devices for practical applications.
Efficient SISR. To make Convolutional Networks suitable for computationally limited platforms such as mobile devices, methods such as pruning, NAS, knowledge distillation, reparameterization, and efficient design of convolutional layers have been proposed. Pruning technology involves removing insignificant connections or neurons from a network to reduce its size and complexity, thereby improving generalization ability and computational speed. NAS technology [24], on the other hand, automates the search for the optimal neural structure by exploring various combinations of structures across different platforms with varying computational capabilities. Knowledge distillation technology [19], a method for training smaller models, transfers knowledge from larger, more complex models to enhance performance while reducing computational requirements. Structural reparameterization [25] technology utilizes a multi-branch architecture during training and switches to a plain network during testing to achieve faster inference speed. Efficient convolutional layers, such as depth-wise convolution [26] and convolutional factorization [27], reduce computational resources while maintaining high performance. These design concepts have significantly contributed to the advancement of SISR. However, many existing methods either focus on local spatial information and lack global context understanding, or have high computational complexity that limits their applicability to edge devices. In this work, we propose a hybrid structure called LGUN that combines the strengths of Convolutional Networks (e.g., spatial inductive biases and local connectivity) and Transformers (e.g., input-adaptive weighting and global context processing). Notably, our approach achieves a superior trade-off between complexity and performance (Parameters/Multi-Adds @ PSNR/SSIM: 675K/141G @ 38.24/0.9618).

2.2. Transformers

Pioneer work. Recently, Transformers have attracted significant interest in the computer vision community, thanks to their success in natural language processing (NLP) field. Several studies have explored the benefits of using a Transformer in vision tasks, e.g., FAT [28] and RISTRA [29]. The seminal work, Vision Transformer (ViT) [30], applies a standard Transformer architecture directly to 2D images for visual recognition and demonstrated promising results. The Image Processing Transformer (IPT) [23] leverages the power of the Transformer to achieve superior performance on various image restoration tasks, such as SR, denoising, and deraining. However, the quadratic computational cost make it difficult to apply the SA mechanism to the SISR task.
Efficient Transformers. Numerous efforts have been made to reduce complexity and maintain performance in order to make Transformers more suitable for vision tasks. For instance, Swin Transformer [31] and SwinIR [32] limit the SA calculation to non-overlapping local windows instead of the global scope and introduce a shift operation for cross-window interaction. This approach significantly reduces computational complexity on HR feature maps while capturing local context. Similarly, shuffle Transformer [33] and HaloNet [34] utilize spatial shuffle and halo operations, respectively, instead of shifted window partitioning. MobileViT [35] employs element-wise operations as replacements for computationally and memory-intensive operations, such as batch-wise matrix multiplication and softmax, to compute context scores. Linformer [36] substitutes self-attention with low-rank approximation operations. Axial self-attention [37] achieves longer-range dependencies in the horizontal and vertical directions by performing SA within each single row or column of the feature map. CSWin [38] proposes a cross-shaped window SA region that includes multiple rows and columns, while Pale Transformer [39] performs SA within a pale-shaped region composed of the same number of interlaced rows and columns of the feature map. Although these methods achieve a trade-off in performance across various vision tasks, the dependencies in the SA layer are limited to local regions to reduce computational complexity, resulting in insufficient context modeling. This limitation restricts the modeling capacity of the entire network. In this study, we propose DGSA, which models long-range non-local dependencies while achieving an effective receptive field of H × W that operates across the feature dimension. The interactions are based on the cross-covariance matrix between keys and queries. Importantly, the computational complexity is only linear, O ( N C 2 ) , rather than quadratic, O ( N 2 C ) , where C is much smaller than N.
Sparse Transformers. In addition, the utilization of global-based attention involves computing attention matrices that consider all image patches (tokens), prompting the question of whether it is necessary for all elements in the sequence to be attended. The answer to this query is: NO! The inherent dense calculation pattern of the SA mechanism amplifies the weights of relatively lower similarities, rendering the feature interaction and aggregation process susceptible to implicit noise. Consequently, redundant or irrelevant representations continue to influence the modeling of global feature dependencies. Numerous studies have demonstrated that the adoption of sparse attention matrices can enhance model performance while reducing memory usage and computational requirements. For instance, Sparse Transformer [40] employs a factorized operation to mitigate complexity and suggests reducing the spatial dimensions of attention’s key and value matrices. Explicit Sparse Transformer [41] improves attention concentration on the global context by explicitly selecting the most relevant segments in natural language processing (NLP) tasks. EfficientViT [42] further addresses redundancy in attention maps by explicitly decomposing the computation of each head and feeding them with diverse features. In this study, instead of computing the attention matrix for all query–key pairs as in the conventional SA mechanism, we adopt a selective approach in the proposed DGSA. Specifically, we choose the top-k most similar keys and values for each query. However, the use of predefined k values can be seen as a form of hard coding, potentially impeding the relational learning between pairwise pixels. To mitigate this issue, we generate multiple attention matrices with different degrees of sparsity by employing multiple k values. These matrices are then weighted by adaptively learned coefficients for fusion. Our approach can give higher attention to high-contributing regions while giving stronger suppression to low-contributing regions.

2.3. Combination of Transformers and Convolutional Networks

Several works have incorporated classical design principles of Convolutional Networks into Transformers. These include (1) preserving locality property [43,44,45,46,47,48] and (2) adopting specific network architectures such as U-Net [14,49,50,51], hierarchical pyramid-like structures [52,53,54], and two-stream architectures [55]. On the other hand, MobileViT [35] and MobileFormer [56] successfully combine MobileNet [57] and ViT [30] to achieve competitive results on mobile devices. HAT [58] introduces a hybridized network with parallel branches for channel attention and multi-head self-attention (MHSA) to reconstruct individual pixels or small regions. ACT [59] utilizes both Transformer and convolution branches and implements a fuse–split strategy to efficiently aggregate local–global information at each stage. In this work, we propose a novel hybridization structure, named LGUN, which leverages the advantages of Convolutional Networks, such as spatial inductive biases and local connectivity, and combines them with Transformers’ input-adaptive weighting and global context processing. By encoding shallow, fine-grained local information and effectively interacting with deep global contextual information, our approach achieves a higher complexity–performance trade-off (Parameters/Multi-Adds @ PSNR/SSIM: 542K/113G @ xxx).

3. Methods

3.1. Overall Architecture

The proposed network architecture consists of three primary components: (1) feature extraction FE ( · ) , (2) nonlinear mapping NLM ( · ) , and (3) reconstruction REC ( · ) . The input and output of the model are denoted as I LR R H × W × 3 and I SR R H × W × 3 , respectively. In the initial stage, I LR undergoes an overlapped image patch embedding process, where a 3 × 3 convolution layer is applied at the beginning of the network. This results in F embed R H × W × C feature maps. Subsequently, F embed passes through N stacked blocks to facilitate the learning of local and global relationships. The final reconstructed result is obtained as follows: I SR = REC ( NLM ( F embed ) + F embed ) .

3.2. LGU

The core modules of LGU, as depicted in Figure 3, include Multi-order Local Hierarchical Aggregation (MLHA) and Dynamic Global Sparse Attention (DGSA). The MLHA module efficiently encodes local spatial information by feeding each sub-branch with a subset of the entire feature, facilitating the explicit learning of distinct feature patterns. On the other hand, the DGSA module aims to model long-range non-local dependencies by leveraging interactions across feature dimensions, resulting in an effective global receptive field. This design ensures robustness to changes in the input token length while reducing computational complexity to O ( N C 2 ) , where C N . More specific details are provided below:
S h a l l o w L a y e r X = X + M L H A ( N o r m ( X ) ) X = X + F F N ( N o r m ( X ) )
D e e p L a y e r Z = Z + D G S A ( N o r m ( Z ) ) Z = Z + F F N ( N o r m ( Z ) )

3.3. Multi-Order Local Hierarchical Aggregation (MLHA)

In the shallow layer of our method, we employ MLHA to focus on encoding local spatial information. By using the Split–Transform–Fusion (STF) strategy, MLHA promotes the learning of different patterns while also saving computational resources.
Given the input feature X R H × W × C , it passes through three consecutive units: L i n e a r M L H A L i n e a r . The specific details of MLHA are as follows:
Firstly, split. The input feature F in R H × W × C is divided into m subparts denoted by x i . Each subpart has the same spatial size of H × W and a channel number of 1 s C , where i 1 , 2 , . . . , m .
Secondly, transform. Each subpart feature x i is individually processed by a large kernel convolutional sequence (LKCS) denoted as L K C S i ( · ) , which performs self-adaptive recalibration of the subpart features. Each L K C S i ( · ) has a similar structure: DW - Conv k 1 × k 1 , DW - D - Conv k 2 × k 2 , and Conv k 3 × k 3 .
Finally, fusion. The MLHA integrates multiple re-weighting L K C S i ( · ) processes, enabling the modeling of spatial pixel relationships and the interaction of multi-order context information for input content self-adaptation. Specifically, each subpart feature x i ( i > 1 ) is added to the output of L K C S i 1 ( · ) and then passed to the next branch L K C S i ( · ) for further processing. The output feature y i of L K C S i ( · ) corresponds to the input x i and is passed to the concatenation layer. The concatenation layer aggregates large-range spatial relationships and multi-order context information, treating them as weight matrices for self-adaptive modulation of the input feature F in . By effectively mining the underlying relevance of F in , positions with high scores receive adequate attention while insignificant positions are suppressed. This flexible and effective modulation of the feature representation promotes the modeling of complex image patterns with high adaptability and representational power. The process can be expressed as follows:
F MLHA = F in C o n c a t ( y 1 , . . . , y s )
y i = x i , i = 1 ; L K C S i ( x i + y i 1 ) , 1 < i s

3.4. Dynamic Global Sparse Attention (DGSA)

The token-based SA mechanism calculates the weight matrix along the token dimension. However, the quadratic increase in computational complexity as the sequence length N grows makes it unsuitable for long sequences and high-resolution images. To address this, compromise solutions have been proposed with two approaches: (1) replacing global SA with local SA, which restricts the SA calculation to local windows, and (2) reducing the sequence length of the key and the value through pooling or stride convolution. However, the former method can only capture dependencies within a limited local range, thus constraining the modeling capacity of the entire network to a local region. The latter method, on the other hand, may result in excessive downsampling, leading to information loss or the confusion of relationships, which contradicts the purpose of SISR. In this work, we present an efficient solution that enables global interactions in SA with linear complexity. Instead of considering global interactions between all tokens, we propose the use of Dynamic Global Sparse Attention (DGSA), which operates across feature channels rather than tokens. In DGSA, the interactions are based on the cross-covariance matrix computed over the key and query projections of the token features. The specific details are as follows:
Consider an input token sequence, X R N × D , where N and D denote the length and dimension of the input sequence, respectively. DGSA first generates the query Q , key K , and value V using linear project layers from X ,
Q = X W q , K = X W k , V = X W v
where W q , W k , and W v R D × D h are learnable weight matrices and D h is the number of project dimensions. Next, the output of DGSA is computed as a weighted sum over N value vectors,
A ( Q , K , V ) = V · S o f t m a x ( K · Q d h )
Importantly, DGSA has a linear complexity of O ( N ) rather than O ( N 2 ) in vanilla SA.
As mentioned in the Introduction, to address the potential negative impact of irrelevant or confusing information in the SISR task, we introduce a Multi-stage Token Selection (MTS) strategy. As shown in Figure 4, this strategy involves selecting the top-k similar tokens from the keys for each query in order to compute the attention weight matrix. To achieve this, we employ multiple different k values parallelly, resulting in multiple attention matrices with varying degrees of sparsity. The final output is obtained by combining these matrices through a weighted sum. The DGSA with MTS can be expressed as follows:
D G S A ( Q , K , V ) = n = 1 3 w n D G S A k n ( Q , K , V )
D G S A k n ( Q , K , V ) = V · S o f t m a x T k n ( K · Q d h )
where w 1 , w 2 , and w 3 represent the assigned weight, which is obtained through dynamic adaptation learning by the network, with an initial value of 0.1, and T k n ( · ) is the dynamic learnable row-wise top-k selection operator:
T k ( A ) i j = A i j A i j top - k ( row j i n f otherwise
We set Multi-stage Token Selection thresholds k 1 , k 2 , and k 3 to 1 2 , 2 3 , and 3 4 , respectively.
In conclusion, DGSA offers two significant advantages. Firstly, it enables the modeling of global correlations by selecting the most similar tokens from the entire attention matrix while effectively filtering out irrelevant ones. Secondly, by employing a weighted sum of multiple attention matrices with varying degrees of sparsity, the model can adequately capture the underlying relevance between all pairs of positions. This approach assigns higher weights to positions of greater importance while suppressing insignificant positions. Consequently, it facilitates the identification of crucial features and their effective utilization in subsequent processing steps. Through this mechanism, our method adaptively selects high-contributing scores from input elements, promoting the modeling of complex image patterns with enhanced adaptability and representational power.

3.5. Feed-Forward Network (FFN)

The original Feed-Forward Network (FFN) has limitations in modeling local patterns and spatial relationships, which are crucial for SISR. The inverted residual block (IRB) incorporates a depth-wise convolution between two linear transform layers. This design enables the aggregation of local information among neighboring pixels within each channel. Building upon this idea, we adopted the IRB’s design paradigm, and the point-wise convolutional layers in the vanilla FFN were replaced with a combination of depth-wise convolutions and excitation-and-squeeze modules. This modification captures local patterns and structures effectively. Further details are provided below.
F F N ( X ) = L i n e a r ( σ ( S A L ( L i n e a r ( X ) ) ) )
where σ indicates the nonlinear activation function GELU. S A L indicates the spatial awareness layer.

3.6. Discussion

As mentioned earlier, our method combines the strengths of Convolutional Networks, such as spatial inductive biases and local connectivity, with Transformers, which provide input-adaptive weighting and global context processing. This integration allows us to achieve a favorable balance between complexity and performance. The advantages of our approach can be summarized as follows:
(1) Fine-grained local modeling. The MLHA incorporates a re-weighting process into both the sub-branch and entire features. By utilizing the extracted convolutional features as weight matrices, we can self-adaptively re-calibrate the input representations, effectively capturing spatial relationships and enabling multi-order feature interactions. This approach ensures that important positions receive appropriate focus while suppressing insignificant positions. It is worth noting that each sub-branch feature x i can receive features from all subparts x i , j i , and passes through large kernel convolutional sequences, resulting in a larger receptive field.
(2) Efficient global interaction. The DGSA is capable of modeling long-range non-local dependencies while obtaining an effective global receptive field. The interactions in DGSA operate across feature dimensions and are based on the cross-covariance matrix between keys and queries. To avoid interference with subsequent super-resolution tasks, our MTS strategy selects multiple top-k similarity scores between queries and keys for attention matrix calculation. This strategy masks out insignificant elements with lower weights, reducing redundancy in attention maps and suppressing clutter background interference, thereby facilitating better feature aggregation.
(3) Linear complexity. Our method remains robust to changes in the input token length while achieving linear computational complexity of O ( N C 2 ) , where C N . This enables flexible and effective modeling of feature representation, promoting the capture of complex image patterns with high representational power.

4. Experiments

4.1. Implementation Details

Our proposed method comprises 16 fundamental building blocks, with each block having 64 channels. Minor channel adjustments are made only in the image reconstruction part for the ×2, ×3, and ×4 scales. To evaluate the effectiveness of our proposed method, we tested it on five common benchmark datasets: Set5 [60], Set14 [61], BSD100 [62], Urban100 [63], and Manga109 [64]. We measured the average peak-signal-to-noise ratio (PSNR) and the structural similarity (SSIM) on the luminance (Y) channel of YCbCr space. Our method was implemented using Pytorch 1.12.0 and trained on a single NVIDIA RTX 3090 GPU. More hyper-parameters of the training process are shown in Table 1.

4.2. Comparison with State-of-the-Art (SOTA) Methods

To validate the effectiveness of our method, we present the reconstruction results obtained by various SR models on both natural and satellite remote sensing images. These images were captured using common optical sensors (e.g., CMOS) as well as satellite sensors (e.g., millimeter-wave sensors). First, we verify the effectiveness of our proposed method on natural images. In Section 4.2.3, we verify the effectiveness of the method on satellite remote sensing images.

4.2.1. Quantitative and Qualitative Results

In Table 2, we compare the proposed method with recent SOTA efficient SISR approaches for upscale factors of ×2, ×3, and ×4 on five benchmark datasets. For instance, we used SRCNN [15], VDSR [11], DRCN [65], LapSRN [66], MemNet [67], SRFBN-S [68], IDN [69], CARN [70], EDSR [12], FALSR-A [18], SMSR [71], A2N [72], LMAN [26], DRSDN [24], SwinIR [32], and NGswin [73]. Notably, SwinIR [32] and NGswin [73] are recently advanced Transformer-based methods. Specifically, in Set5, the average PSNR value at ×2 scale is improved by 0.63 and the average SSIM value of ×2 scale is improved by 0.0036 on average over other methods; the average PSNR value at ×4 scale is improved by 0.89 and the average SSIM value at ×4 scale is improved by 0.0144 on average over other methods. In Set14, the average PSNR value at ×2 scale is improved by 0.64 and the average SSIM value at ×2 scale is improved by 0.0079; the average PSNR value at ×4 scale is improved by 0.64 and the average SSIM value at ×4 scale is improved by 0.0165 on average over other methods. Obviously, with a lower complexity, our method (Parameters/Multi-Adds @ PSNR/SSIM: 542K/113G @ 38.24/0.9618) obtains better PSNR/SSIM results compared to recently improved Transformer-based and Convolutional Network-based methods, such as SwinIR (878K/243.7G @ 38.14dB/0.9611) and NGswin (998K/140.4G @ 38.05dB/0.9610).
In Figure 5, we present the qualitative comparison results for different methods at upscale factors of ×4. For the images “img 024”, “img 067”, “img 071”, “img 073” and “img 076” in the Urban100 dataset, our method demonstrates superior reconstruction of lattice and text patterns with minimal blurriness and artifacts compared to other methods. This observation confirms the usefulness and effectiveness of our approach. Taking the image “img 024” as an example, our method accurately generates stripes with the correct direction and minimal blurring, while the other methods produce incorrect stripes and a noticeable blur over a wide range.

4.2.2. Visualization Analysis

LAM Results. In Figure 6, we analyze the local attribution map (LAM [76]) results for SwinIR [32], AAN [72], LMAN [26], and our method to investigate the utilization range of pixels in the input image during the reconstruction of the selected area. We employ the diffusion index (DI) as an evaluation metric to assess the model’s ability to extract features and utilize relevant information. As illustrated in Figure 6, our method exhibits the utilization of a larger range of pixel information in reconstructing the area outlined by a red box. This observation demonstrates that our approach achieves a larger receptive field through an efficient local and global interaction.
To facilitate intuitive comparisons, we present a heat map, as shown in Figure 7, illustrating the differences in interest areas between the SR networks (referred to as “Diff”). An observation can be made that the proposed LGUN exhibits a more extensive diffusion region compared to CARN [70], EDSR [12], SwinIR [32], and AAN [72]. This observation indicates that our designs enable the exploitation of a greater amount of intra-frame information while maintaining limited network complexity. This is primarily attributed to the MLHA and DGSA employed in LGUN, which facilitate the learning of diverse information ranges and the selective retention of spatial textures deemed useful.

4.2.3. Remote Sensing Image Super-Resolution

Satellite sensors play a vital role in remote sensing by capturing images and data of the Earth’s surface from space. These sensors are mounted on Earth-orbiting satellites and are specifically designed to gather information across multiple wavelengths of the electromagnetic spectrum. Remote sensing images obtained from satellite sensors offer valuable insights for a wide range of applications, including environmental monitoring, land use classification, disaster management, and climate studies.
One crucial task of remote sensing is SISR, which aims to enhance the resolution of satellite images. Higher-resolution images provide more accurate and detailed information about the Earth’s surface, which is crucial for various applications. Therefore, SISR plays a pivotal role in maximizing the usefulness of remote sensing data. To demonstrate the effectiveness of our proposed method in enhancing remote sensing images obtained from satellite sensors, we present the SISR results of different networks in Figure 8. Our network exhibits clear advantages in recovering remotely sensed images, particularly in capturing texture details, lines, and repetitive structures. In contrast, other contrast algorithms often introduce artifacts and blending issues when dealing with remote sensing images that have complex backgrounds. At the same time, our network effectively mitigates blurring artifacts and reconstructs edge details with higher fidelity.

4.3. Ablation Study

In Table 3, we present the results of the ablation study for our method. Below, we discuss the ablation results based on the following aspects:
The influence of the structure configuration. The primary objective of this study was to efficiently encode local spatial information, model long-range non-local dependencies, and achieve a global receptive field by leveraging the strengths of Convolutional Networks, which provide spatial inductive biases and local connectivity, and Transformers, which offer input-adaptive weighting and global context interaction. In order to validate the effectiveness of the two core modules, namely MLHA and DGSA, we conducted experiments where one module was removed while the other was retained. The results, as presented in Table 3(a), demonstrate a significant decrease in model performance when either of the modules is removed. These findings indicate that the model benefits from both the global interaction introduced by the DGSA module and the fine-grained local modeling achieved by MLHA.
The influence of the MLHA part. In the initial layers of our model, we utilize MLHA to efficiently encode local spatial information. This is achieved by feeding each sub-branch with a specific subset of the complete feature. The effectiveness of the STF strategy is demonstrated in Table 3(b), where it is shown to enhance the explicit learning of distinct feature patterns within the network, leading to improved performance compared to models trained without the STF strategy.
The influence of the DGSA part. In the deeper layers of our model, we introduce DGSA to effectively model long-range non-local dependencies and achieve a global receptive field of H × W. To reduce redundancy in attention maps and mitigate interference from cluttered backgrounds, we employ the MTS strategy, which selects multiple top-k similar attention matrices and masks out elements with lower weights. In Table 3(c), we display the results of a series of experiments to assess the effectiveness of the DGSA module. These experiments include scenarios with no sparse attention (w/o top-k), sparse attention (w/top-k), and sparse attention with the MTS strategy (top-k with MTS). The results of these experiments indicate that employing sparse attention with the MTS strategy leads to improved performance.
The influence of the design of LKCS in the MLHA part. We conducted an experiment to verify the effectiveness of three LKCS modules in our MLHA. Specifically, each LKCS module consists of three convolution layers: DW-Conv layer, DW-D-Conv layer, and Conv layer. The three LKCS modules differ in the kernel size of the three convolution layers they contain. In the first LKCS module, the kernel sizes of the three convolution layers are 3, 5, and 1. In the second LKCS module, The kernel sizes of the three convolution layers are 5, 7, and 1. And in the third LKCS module, the kernel sizes of the three convolution layers are 7, 9, and 1. We wanted to show the effectiveness of extracting features using different kernel sizes. We conducted the experiments, in which the three LKCS modules were exactly the same. The kernel sizes of the three convolution layers in all three LKCS modules were set to 5, 7, and 1. The results are shown in Table 3(d), which shows the effectiveness of our proposed LKCS module.

4.4. Application

There are many potential applications of the Lightweight Image Super-Resolution approach. For example, in surveillance, SR techniques can enhance video resolution, making images sharper and clearer so that details, such as facial features and licence plate numbers, can be more easily identified, thus enhancing security. In medical imaging, SR technology can improve the clarity of medical images and help doctors diagnose conditions more accurately. In the field of satellite imagery, SR technology can improve image quality and make remote sensing data analysis more accurate, which is used in environmental monitoring, urban planning, and other fields. The lightweight SR method is particularly suitable for resource-constrained devices and real-time processing scenarios due to its low computation and storage requirements.

5. Conclusions

The aim of this study is to develop a lightweight and high-performance network for SISR by effectively combining the strengths of Transformers and Convolutional Networks. To achieve this objective, we propose a novel lightweight SISR method called LGUN. LGUN focuses on encoding local spatial information within MLHA and utilizes the Split–Transform–Fusion (STF) strategy to facilitate the learning of diverse patterns. Additionally, it models global context dependencies through the core module: DGSA. DGSA selects multiple top-k similar attention matrices and masks out elements with lower weights, thereby reducing redundancy in attention maps and suppressing interference from cluttered backgrounds. The experimental results, evaluated on popular benchmarks, demonstrate the superior quantitative and qualitative performance of our method.

Author Contributions

Conceptualization, L.X.; methodology, L.X.; software, L.X.; validation, X.L.; formal analysis, X.L.; investigation, L.X. and X.L.; resources, C.R.; data curation, L.X.; writing—original draft preparation, L.X.; writing—review and editing, C.R.; visualization, L.X.; supervision, C.R.; project administration, C.R.; funding acquisition, C.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62171304, the Natural Science Foundation of Sichuan Province under Grant 2024NSFSC1423, and the Cooperation Science and Technology Project of Sichuan University and Dazhou City under Grant 2022CDDZ-09.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The public data used in this work are listed here: Flickr2K [12], Set5 [60], Set14 [61], Urban100 [63], BSDS100 [62], Manga109 [64] and DIV2K [77].

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Silva, N.P.; Amin, B.; Dunne, E.; Hynes, N.; O’Halloran, M.; Elahi, A. Implantable Pressure-Sensing Devices for Monitoring Abdominal Aortic Aneurysms in Post-Endovascular Aneurysm Repair. Sensors 2024, 24, 3526. [Google Scholar] [CrossRef]
  2. Silva, N.P.; Elahi, A.; Dunne, E.; O’Halloran, M.; Amin, B. Design and Characterisation of a Read-Out System for Wireless Monitoring of a Novel Implantable Sensor for Abdominal Aortic Aneurysm Monitoring. Sensors 2024, 24, 3195. [Google Scholar] [CrossRef] [PubMed]
  3. Negre, P.; Alonso, R.S.; González-Briones, A.; Prieto, J.; Rodríguez-González, S. Literature Review of Deep-Learning-Based Detection of Violence in Video. Sensors 2024, 24, 4016. [Google Scholar] [CrossRef] [PubMed]
  4. Liu, H.; Yang, L.; Zhang, L.; Shang, F.; Liu, Y.; Wang, L. Accelerated Stochastic Variance Reduction Gradient Algorithms for Robust Subspace Clustering. Sensors 2024, 24, 3659. [Google Scholar] [CrossRef] [PubMed]
  5. Chakraborty, D.; Boni, R.; Mills, B.N.; Cheng, J.; Komissarov, I.; Gerber, S.A.; Sobolewski, R. High-Density Polyethylene Custom Focusing Lenses for High-Resolution Transient Terahertz Biomedical Imaging Sensors. Sensors 2024, 24, 2066. [Google Scholar] [CrossRef] [PubMed]
  6. Wang, W.; He, J.; Liu, H.; Yuan, W. MDC-RHT: Multi-Modal Medical Image Fusion via Multi-Dimensional Dynamic Convolution and Residual Hybrid Transformer. Sensors 2024, 24, 4056. [Google Scholar] [CrossRef] [PubMed]
  7. Chang, H.K.; Chen, W.W.; Jhang, J.S.; Liou, J.C. Siamese Unet Network for Waterline Detection and Barrier Shape Change Analysis from Long-Term and Large Numbers of Satellite Imagery. Sensors 2023, 23, 9337. [Google Scholar] [CrossRef] [PubMed]
  8. Njimi, H.; Chehata, N.; Revers, F. Fusion of Dense Airborne LiDAR and Multispectral Sentinel-2 and Pleiades Satellite Imagery for Mapping Riparian Forest Species Biodiversity at Tree Level. Sensors 2024, 24, 1753. [Google Scholar] [CrossRef] [PubMed]
  9. Wan, S.; Guan, S.; Tang, Y. Advancing bridge structural health monitoring: Insights into knowledge-driven and data-driven approaches. J. Data Sci. Intell. Syst. 2023, 2, 129–140. [Google Scholar] [CrossRef]
  10. Wu, Z.; Tang, Y.; Hong, B.; Liang, B.; Liu, Y. Enhanced Precision in Dam Crack Width Measurement: Leveraging Advanced Lightweight Network Identification for Pixel-Level Accuracy. Int. J. Intell. Syst. 2023, 2023, 9940881. [Google Scholar] [CrossRef]
  11. Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
  12. Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
  13. Liu, X.; Liao, X.; Shi, X.; Qing, L.; Ren, C. Efficient Information Modulation Network for Image Super-Resolution. In ECAI 2023; IOS Press: Amsterdam, The Netherlands, 2023; pp. 1544–1551. [Google Scholar]
  14. Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5728–5739. [Google Scholar]
  15. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
  16. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
  17. Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
  18. Chu, X.; Zhang, B.; Ma, H.; Xu, R.; Li, Q. Fast, accurate and lightweight super-resolution with neural architecture search. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 59–64. [Google Scholar]
  19. Gao, Q.; Zhao, Y.; Li, G.; Tong, T. Image super-resolution using knowledge distillation. In Proceedings of the Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Revised Selected Papers, Part II. Springer: Berlin/Heidelberg, Germany, 2019; pp. 527–541. [Google Scholar]
  20. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
  21. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  22. Zhang, Y.; Li, K.; Li, K.; Zhong, B.; Fu, Y. Residual non-local attention networks for image restoration. arXiv 2019, arXiv:1903.10082. [Google Scholar]
  23. Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 12299–12310. [Google Scholar]
  24. Cheng, G.; Matsune, A.; Du, H.; Liu, X.; Zhan, S. Exploring more diverse network architectures for single image super-resolution. Knowl. Based Syst. 2022, 235, 107648. [Google Scholar] [CrossRef]
  25. Wang, X.; Dong, C.; Shan, Y. Repsr: Training efficient vgg-style super-resolution networks with structural re-parameterization and batch normalization. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 2556–2564. [Google Scholar]
  26. Wan, J.; Yin, H.; Liu, Z.; Chong, A.; Liu, Y. Lightweight image super-resolution by multi-scale aggregation. IEEE Trans. Broadcast. 2020, 67, 372–382. [Google Scholar] [CrossRef]
  27. Hui, Z.; Gao, X.; Yang, Y.; Wang, X. Lightweight image super-resolution with information multi-distillation network. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2024–2032. [Google Scholar]
  28. Fan, Q.; Huang, H.; Zhou, X.; He, R. Lightweight vision transformer with bidirectional interaction. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar]
  29. Zhou, X.; Huang, H.; Wang, Z.; He, R. Ristra: Recursive image super-resolution transformer with relativistic assessment. IEEE Trans. Multimed. 2024, 26, 6475–6487. [Google Scholar] [CrossRef]
  30. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  31. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  32. Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
  33. Huang, Z.; Ben, Y.; Luo, G.; Cheng, P.; Yu, G.; Fu, B. Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv 2021, arXiv:2106.03650. [Google Scholar]
  34. Vaswani, A.; Ramachandran, P.; Srinivas, A.; Parmar, N.; Hechtman, B.; Shlens, J. Scaling local self-attention for parameter efficient visual backbones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 12894–12904. [Google Scholar]
  35. Mehta, S.; Rastegari, M. Separable self-attention for mobile vision transformers. arXiv 2022, arXiv:2206.02680. [Google Scholar]
  36. Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar]
  37. Ho, J.; Kalchbrenner, N.; Weissenborn, D.; Salimans, T. Axial attention in multidimensional transformers. arXiv 2019, arXiv:1912.12180. [Google Scholar]
  38. Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12124–12134. [Google Scholar]
  39. Wu, S.; Wu, T.; Tan, H.; Guo, G. Pale transformer: A general vision transformer backbone with pale-shaped attention. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 2731–2739. [Google Scholar]
  40. Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating long sequences with sparse transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar]
  41. Zhao, G.; Lin, J.; Zhang, Z.; Ren, X.; Su, Q.; Sun, X. Explicit sparse transformer: Concentrated attention through explicit selection. arXiv 2019, arXiv:1912.11637. [Google Scholar]
  42. Cai, H.; Gan, C.; Han, S. Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition. arXiv 2022, arXiv:2205.14756. [Google Scholar]
  43. Yuan, K.; Guo, S.; Liu, Z.; Zhou, A.; Yu, F.; Wu, W. Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 579–588. [Google Scholar]
  44. Guo, J.; Han, K.; Wu, H.; Tang, Y.; Chen, X.; Wang, Y.; Xu, C. Cmt: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12175–12185. [Google Scholar]
  45. Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 22–31. [Google Scholar]
  46. Graham, B.; El-Nouby, A.; Touvron, H.; Stock, P.; Joulin, A.; Jégou, H.; Douze, M. Levit: A vision transformer in convnet’s clothing for faster inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 12259–12269. [Google Scholar]
  47. Li, Y.; Zhang, K.; Cao, J.; Timofte, R.; Van Gool, L. Localvit: Bringing locality to vision transformers. arXiv 2021, arXiv:2104.05707. [Google Scholar]
  48. Xiao, T.; Singh, M.; Mintun, E.; Darrell, T.; Dollár, P.; Girshick, R. Early convolutions help transformers see better. Adv. Neural Inf. Process. Syst. 2021, 34, 30392–30400. [Google Scholar]
  49. Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
  50. Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the Computer Vision–ECCV 2022 Workshops, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part III. Springer: Berlin/Heidelberg, Germany, 2023; pp. 205–218. [Google Scholar]
  51. Song, Y.; He, Z.; Qian, H.; Du, X. Vision transformers for single image dehazing. arXiv 2022, arXiv:2204.03883. [Google Scholar] [CrossRef]
  52. Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 568–578. [Google Scholar]
  53. Pan, Z.; Zhuang, B.; Liu, J.; He, H.; Cai, J. Scalable vision transformers with hierarchical pooling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 377–386. [Google Scholar]
  54. Heo, B.; Yun, S.; Han, D.; Chun, S.; Choe, J.; Oh, S.J. Rethinking spatial dimensions of vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 11936–11945. [Google Scholar]
  55. Chen, C.F.R.; Fan, Q.; Panda, R. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 357–366. [Google Scholar]
  56. Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Dong, X.; Yuan, L.; Liu, Z. Mobile-former: Bridging mobilenet and transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5270–5279. [Google Scholar]
  57. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  58. Chen, X.; Wang, X.; Zhou, J.; Qiao, Y.; Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 22367–22377. [Google Scholar]
  59. Yoo, J.; Kim, T.; Lee, S.; Kim, S.H.; Lee, H.; Kim, T.H. Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution. arXiv 2022, arXiv:2203.07682. [Google Scholar]
  60. Bevilacqua, M.; Roumy, A.; Guillemot, C.; Alberi-Morel, M.L. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of the 23rd British Machine Vision Conference (BMVC), Surrey, UK, 3–7 September 2012. [Google Scholar]
  61. Zeyde, R.; Elad, M.; Protter, M. On single image scale-up using sparse-representations. In Proceedings of the Curves and Surfaces: 7th International Conference, Avignon, France, 24–30 June 2010; Revised Selected Papers 7. Springer: Berlin/Heidelberg, Germany, 2012; pp. 711–730. [Google Scholar]
  62. Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the Eighth IEEE International Conference on Computer Vision, ICCV 2001, Vancouver, BC, Canada, 7–14 July 2001; IEEE: Piscataway, NJ, USA, 2001; Volume 2, pp. 416–423. [Google Scholar]
  63. Huang, J.B.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5197–5206. [Google Scholar]
  64. Matsui, Y.; Ito, K.; Aramaki, Y.; Fujimoto, A.; Ogawa, T.; Yamasaki, T.; Aizawa, K. Sketch-based manga retrieval using manga109 dataset. Multimed. Tools Appl. 2017, 76, 21811–21838. [Google Scholar] [CrossRef]
  65. Kim, J.; Lee, J.K.; Lee, K.M. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1637–1645. [Google Scholar]
  66. Lai, W.S.; Huang, J.B.; Ahuja, N.; Yang, M.H. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 624–632. [Google Scholar]
  67. Tai, Y.; Yang, J.; Liu, X.; Xu, C. Memnet: A persistent memory network for image restoration. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4539–4547. [Google Scholar]
  68. Li, Z.; Yang, J.; Liu, Z.; Yang, X.; Jeon, G.; Wu, W. Feedback network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3867–3876. [Google Scholar]
  69. Hui, Z.; Wang, X.; Gao, X. Fast and accurate single image super-resolution via information distillation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 723–731. [Google Scholar]
  70. Ahn, N.; Kang, B.; Sohn, K.A. Fast, accurate, and lightweight super-resolution with cascading residual network. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 252–268. [Google Scholar]
  71. Wang, L.; Dong, X.; Wang, Y.; Ying, X.; Lin, Z.; An, W.; Guo, Y. Exploring sparsity in image super-resolution for efficient inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 4917–4926. [Google Scholar]
  72. Chen, H.; Gu, J.; Zhang, Z. Attention in attention network for image super-resolution. arXiv 2021, arXiv:2104.09497. [Google Scholar]
  73. Choi, H.; Lee, J.; Yang, J. N-Gram in Swin Transformers for Efficient Lightweight Image Super-Resolution. arXiv 2022, arXiv:2211.11436. [Google Scholar]
  74. Liu, C.; Lei, P. An efficient group skip-connecting network for image super-resolution. Knowl. Based Syst. 2021, 222, 107017. [Google Scholar] [CrossRef]
  75. Esmaeilzehi, A.; Ahmad, M.O.; Swamy, M. FPNet: A Deep Light-Weight Interpretable Neural Network Using Forward Prediction Filtering for Efficient Single Image Super Resolution. IEEE Trans. Circuits Syst. II Express Briefs 2021, 69, 1937–1941. [Google Scholar] [CrossRef]
  76. Gu, J.; Dong, C. Interpreting super-resolution networks with local attribution maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 9199–9208. [Google Scholar]
  77. Agustsson, E.; Timofte, R. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 126–135. [Google Scholar]
Figure 1. Trade–off between performance and model complexity on Set5 ×4 dataset. Multi-Adds are calculated on 1280 × 720 HR images.
Figure 1. Trade–off between performance and model complexity on Set5 ×4 dataset. Multi-Adds are calculated on 1280 × 720 HR images.
Sensors 24 05098 g001
Figure 2. Compared to uni-dimensional information communication, e.g., spatial-only or channel-only, our method can achieve local spatial-wise aggregation and global channel-wise interaction simultaneously, both of which are crucial for SISR tasks.
Figure 2. Compared to uni-dimensional information communication, e.g., spatial-only or channel-only, our method can achieve local spatial-wise aggregation and global channel-wise interaction simultaneously, both of which are crucial for SISR tasks.
Sensors 24 05098 g002
Figure 3. The architecture of our proposed method, LGUN, consists of three main parts: feature extraction, nonlinear mapping, and image reconstruction. The core modules, named LGU, include two stages: MLHA and DGSA. In the shallow layers, MLHA efficiently encodes local spatial information by utilizing subsets of the entire feature, enabling explicit learning of distinct feature patterns through the STF strategy. In the deep layers, DGSA is employed to model long-range non-local dependencies while achieving a global effective receptive field. DGSA operates across the feature dimension and leverages interactions based on the cross-covariance matrix between keys and queries. Moreover, we incorporate the MTS strategy into DGSA, which selects multiple top-k similar attention matrices and masks out elements with lower weights. This reduces redundancy in attention maps and suppresses interference from cluttered backgrounds. LGUN exhibits robustness to changes in the input token length and significantly reduces the computational complexity to O ( N C 2 ) , where C N .
Figure 3. The architecture of our proposed method, LGUN, consists of three main parts: feature extraction, nonlinear mapping, and image reconstruction. The core modules, named LGU, include two stages: MLHA and DGSA. In the shallow layers, MLHA efficiently encodes local spatial information by utilizing subsets of the entire feature, enabling explicit learning of distinct feature patterns through the STF strategy. In the deep layers, DGSA is employed to model long-range non-local dependencies while achieving a global effective receptive field. DGSA operates across the feature dimension and leverages interactions based on the cross-covariance matrix between keys and queries. Moreover, we incorporate the MTS strategy into DGSA, which selects multiple top-k similar attention matrices and masks out elements with lower weights. This reduces redundancy in attention maps and suppresses interference from cluttered backgrounds. LGUN exhibits robustness to changes in the input token length and significantly reduces the computational complexity to O ( N C 2 ) , where C N .
Sensors 24 05098 g003
Figure 4. Multiple attention matrices. Take a head as an example ( D = D h ), where w 1 , w 2 , w 3 , and w 4 represent the assigned weight, which is obtained by dynamic adaptation learning of the network. We set Multi-stage Token Selection thresholds k 1 , k 2 , k 3 , and k 4 to 1 2 , 2 3 , 3 4 , and 4 5 , respectively.
Figure 4. Multiple attention matrices. Take a head as an example ( D = D h ), where w 1 , w 2 , w 3 , and w 4 represent the assigned weight, which is obtained by dynamic adaptation learning of the network. We set Multi-stage Token Selection thresholds k 1 , k 2 , k 3 , and k 4 to 1 2 , 2 3 , 3 4 , and 4 5 , respectively.
Sensors 24 05098 g004
Figure 5. Qualitative comparison of state-of-the-art methods on Urban100 [63]. Our method achieves better performance with fewer artifacts and less blur.
Figure 5. Qualitative comparison of state-of-the-art methods on Urban100 [63]. Our method achieves better performance with fewer artifacts and less blur.
Sensors 24 05098 g005
Figure 6. Results of local attribution maps. A more widely distributed red area and higher DI represent a larger range of pixel utilization.
Figure 6. Results of local attribution maps. A more widely distributed red area and higher DI represent a larger range of pixel utilization.
Sensors 24 05098 g006
Figure 7. The heat maps exhibit the area of interest for different SR networks. The red regions are noticed by CARN [70], EDSR [12], SwinIR [32] and AAN [72], while the blue areas represent the additional LAM interest areas of the proposed LGUN. (LGUN has a higher diffusion index).
Figure 7. The heat maps exhibit the area of interest for different SR networks. The red regions are noticed by CARN [70], EDSR [12], SwinIR [32] and AAN [72], while the blue areas represent the additional LAM interest areas of the proposed LGUN. (LGUN has a higher diffusion index).
Sensors 24 05098 g007
Figure 8. Qualitative comparison of state-of-the-art methods on AID dataset.
Figure 8. Qualitative comparison of state-of-the-art methods on AID dataset.
Sensors 24 05098 g008
Table 1. Hyper-parameters of the training process.
Table 1. Hyper-parameters of the training process.
Training ConfigSettings
Random rotation( 90 , 180 , 270 )
  Random flippingHorizontal
  Patch size64 × 64
  Batch size16
  Base learning rate5 × 10−4
  Optimizer momentum β 1 = 0.9, β 2 = 0.999
  Weight decay1 × 10−4
  Learning rate scheduleCosine decay
  Learning rate bound1 × 10−7
Table 2. Quantitative comparison with SOTA methods on five popular benchmark datasets. Thicker text indicates the best results. ‘Multi-Adds’ is calculated with a 1280 × 720 HR image. The bold font shows the best value in every group.
Table 2. Quantitative comparison with SOTA methods on five popular benchmark datasets. Thicker text indicates the best results. ‘Multi-Adds’ is calculated with a 1280 × 720 HR image. The bold font shows the best value in every group.
MethodScale#Params (K)Multi-Adds (G)Set5Set14BSDS100Urban100Manga109
Bicubic×233.66/0.929930.24/0.868829.56/0.843126.88/0.840330.80/0.9339
SRCNN (TPAMI’14)  [15]×25752.736.66/0.954232.45/0.906731.36/0.887929.50/0.894635.60/0 9663
VDSR (CVPR’16) [11]×2665612.637.53/0.959033.05/0.913031.90/0.896030.77/0.914037.22/0.9750
DRCN (CVPR’16) [65]×217749788.737.63/0.958833.04/0.911831.85/0.894230.75/0.913337.55/0.9732
LapSRN (CVPR’17) [66]×281329.937.52/0.959133.08/0.913031.08/0.895030.41/0.910137.27/0.9740
MemNet (ICCV’17) [67]×2677623.937.78/0.959733.28/0.914232.08/0.897831.31/0.919537.72/0.9740
IDN (CVPR’18) [69]×2553127.737.83/0.960033.30/0914832.08/0.898531.27/0.919638.01/0.9749
CARN (ECCV’18) [70]×21592222.837.76/0.959033.52/0.916632.09/0.897831.92/0.925638.36/0.9765
EDSR-baseline (CVPR’19) [12]×2137031637.99/0.960433.57/0.917532.16/0.899431.98/0.927238.54/0.9769
SRFBN-S (CVPR’19) [68]×2282574.437.78/0.959733.35/0.915632.00/0.897031.41/0.920738.06/0.9757
FALSRA (ICPR’21) [18]×21021234.737.82/0.959533.55/0.916832.12/0.898731.93/0.9256-
SMSR (CVPR’21) [71]×2985131.638.00/0.960133.64/0917932.17/0.899032.19/0.928438.76/0.9771
A2N (arXiv’19) [72]×21036247.538.06/0.960833.75/0919432.22/0900232.43/0.931138.87/0.9769
LMAN (TBC’21) [26]×21531347.138.08/0.960833.80/0.902332.22/0.900132.42/0.930238.92/0.9772
SwinIR (ICCV’21) [32]×2878243.738.14/0.961133.86/0.920632.31/0.901232.76/0.934039.12/0.9783
B-GSCN 10 (KBS’21) [74]×2149034338.04/0.960633.64/0.918232.19/0.899932.19/0.929338.64/0.9771
DRSDN (KBS’21) [24]×21055243.138.06/0.960733.65/0.918932.23/0.900332.40/0.9308-
FPNet (TCSVT’22) [75]×21615-38.13/0.961933.83/0.919832.29/0.901832.04/0.9278-
NGswin (CVPR’23) [73]×2998140.438.05/0.961033.79/0.919932.27/0.900832.53/0.932438.97/0.9777
LGUN (Ours)×2675141.138.24/0.961833.93/0.920832.34/0.902732.65/0.932239.38/0.9786
Bicubic×330.39/0.868227.55/0.774227.21/0.738524.46/0.734926.95/0.8556
SRCNN (TPAMI’14) [15]×35752.732.75/0.909029.30/0.821528.41/0.786326.24/0.798930.48/0.9117
VDSR (CVPR’16) [11]×3665612.633.67/0.921029.78/0.832028.83/0.799027.14/0.829032.01/0.9340
DRCN (CVPR’16) [65]×317749788.733.82/0.922629.76/0.831128.80/0.796327.14/0.827932.24/0.9343
MemNet (ICCV’17) [67]×3677623.934.09/0.924830.01/0.835028.96/0.800127.56/0.837632.51/0.9369
IDN (CVPR’18) [69]×35535734.11/0.925329.99/0.835428.95/0.801327.42/0.83593271/0.9381
CARN (ECCV’18) [70]×31592118.834.29/0.925530.29/0.840729.06/0.803428.06/0.849333.50/0.9440
EDSR-baseline (CVPR’19) [12]×3155516034.37/0.927030.28/0.841729.09/0.805228.15/0.852733.45/0.9439
SRFBN-S (CVPR’19) [68]×3375686.434.20/0.925530.10/0.837228.96/0.801027.66/0.841533.02/0.9404
SMSR (CVPR’21) [71]×399367.834.40/0.927030.33/0.841229.10/0.805028.25/0.853633.68/0.9445
A2N (arXiv’19) [72]×31036117534.47/0.927930.44/0.843729.14/0.805928.41/0.857033.78/0.9458
LMAN (TBC’21) [26]×31718173.834.56/0.928630.46/0.843929.17/0.806728.47/0.857634.00/0.9470
SwinIR (ICCV’21) [32]×3886109.534.60/0.928930.54/0.846329.20/0.808228.66/0.862433.98/090978
B-GSCN 10 (KBS’21) [74]×3151015434.30/0.927130.35/0.842529.11/0.803528.20/0.853533.54/0.9445
DRSDN (KBS’21) [24]×31071109.834.48/0.928230.41/0.844529.17/0.807228.45/0.8589-
FPNet (TCSVT’22) [75]×31615-34.48/0.928530.53/0.845429.20/0.808628.19/0.8534-
NGswin (CVPR’23) [73]×3100766.634.52/0.928230.53/0.845629.19/0.807828.52/0.860333.89/0.9470
LGUN (Ours)×368463.534.60/0.929230.54/0.845829.25/0.810228.53/0.858634.26/0.9480
Bicubic×428.42/0.810426.00/0.702725.96/0.667523.14/0.657724.89/0.7866
SRCNN(TPAMI’14) [15]×45752.730.48/0.862827.50/0.751326.90/0.710124.52/0.72127.58/0.85555
VDSR(CVPR’16) [11]×4665612.631.35/0.883028.02/0.768027 29/0.726025.18/0.754028.83/0.8870
DRCN(CVPR’16) [65]×417749788.731.53/0.885428.02/0.767027.23/0.723325.18/0.752428.93/0.8854
LapSRN(CVPR’17) [66]×4813149.431.54/0.885028.19/0.772027.32/0.727025.21/0.756029.09/0.8900
MemNet(ICCV’17) [67]×4677623.931.74/0.889328.26/0.772327.40/0.728125.50/0.763029.42/0.8942
IDN(CVPR’18) [69]×455332.331.82/0.890328.25/0.773027.41/0.729725.41/0.763229.41/0.8942
CARN(ECCV’18) [70]×4159290.932.13/0.893728.60/0.780627.58/0.734926.07/0.783730.47/0.9084
EDSR-baseline(CVPR’19) [12]×4151811432.09/0.893828.58/0.781327.57/0.735726.04/0.784930.35/0.9067
SRFBN-S(CVPR’19) [68]×4483852.931.98/0.892328.45/0.777927.44/0.731325.71/0.771929.91/0.9008
SMSR(CVPR’21) [71]×4100641.632.12/0.893228.55/0.780827.55/0.735126.11/0.786830.54/0.9085
A2N(arXiv’19) [72]×4104772.432.30/0.896628.71/0.784227.61/0.737426.27/0.792030.67/0.9110
LMAN(TBC’21) [26]×41673122.032.40/0.897428.72/0.784227.66/0.738826.36/0.793430.84/0.9129
SwinIR(ICCV’21) [32]×489761.732.44/0.897628.77/0.785827.69/0.740626.47/0.798030.92/0.9151
B-GSCN 10(KBS’21) [74]×415308832.18/0.895028.60/0.782127.59/0.736426.12/0.787230.50/0.9080
DRSDN(KBS’21) [24]×4109563.132.28/0.896228.64/0.783627.64/0.738826.30/0.7933-
FPNet(TCSVT’22) [75]×41615-32.32/0.896228.78/0.785627.66/0.739426.09/0.7850-
NGswin(CVPR’23) [73]×4101936.432.33/0.896328.78/0.785927.66/0.739626.45/0.796330.80/0.9128
LGUN (Ours)×469636.432.63/0.900828.94/0.789727.82/0.745826.88/0.808431.52/0.9183
Table 3. Ablation experiments on the micro structure design. The bold font shows the best value in every group.
Table 3. Ablation experiments on the micro structure design. The bold font shows the best value in every group.
(a) Results for the MLHA and DGSA modules.
LGUSet5Set14BSDS100Urban100Manga109
PSNR/SSIMPSNR/SSIMPSNR/SSIMPSNR/SSIMPSNR/SSIM
w/o MLHA38.19/0.961633.84/0.919932.28/0.901832.49/0.930739.31/0.9784
w/o DGSA38.15/0.961233.65/0.918032.25/0.901432.18/0.928439.11/0.9780
w MLHA + DGSA (Ours)38.24/0.961833.93/0.920832.34/0.902732.65/0.932239.38/0.9786
(b) Results for the STF strategy in MLHA.
MLHASet5Set14BSDS100Urban100Manga109
PSNR/SSIMPSNR/SSIMPSNR/SSIMPSNR/SSIMPSNR/SSIM
w/o STF38.20/0.961633.89/0.920032.30/0.902032.48/0.930939.28/0.9781
w STF (Ours)38.24/0.961833.93/0.920832.34/0.902732.65/0.932239.38/0.9786
(c) Results for the MTS strategy in DGSA.
DGSASet5Set14BSDS100Urban100Manga109
PSNR/SSIMPSNR/SSIMPSNR/SSIMPSNR/SSIMPSNR/SSIM
w/o top-k38.21/0.961533.87/0.920132.32/0.902432.56/0.931639.30/0.9785
w top-k38.22/0.961633.90/0.920332.32/0.902432.57/0.931739.34/0.9786
top-k with MTS (Ours)38.24/0.961833.93/0.920832.34/0.902732.65/0.932239.38/0.9786
(d) Results for the effectiveness of LKCS modules in MLHA.
MLHASet5Set14BSDS100Urban100Manga109
PSNR/SSIMPSNR/SSIMPSNR/SSIMPSNR/SSIMPSNR/SSIM
Identical LKCS38.11/0.960933.62/0.917532.19/0.900832.13/0.927739.05/0.9772
Different LKCS (Ours)38.24/0.961833.93/0.920832.34/0.902732.65/0.932239.38/0.9786
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xiao, L.; Liao, X.; Ren, C. Lightweight Single Image Super-Resolution via Efficient Mixture of Transformers and Convolutional Networks. Sensors 2024, 24, 5098. https://doi.org/10.3390/s24165098

AMA Style

Xiao L, Liao X, Ren C. Lightweight Single Image Super-Resolution via Efficient Mixture of Transformers and Convolutional Networks. Sensors. 2024; 24(16):5098. https://doi.org/10.3390/s24165098

Chicago/Turabian Style

Xiao, Luyang, Xiangyu Liao, and Chao Ren. 2024. "Lightweight Single Image Super-Resolution via Efficient Mixture of Transformers and Convolutional Networks" Sensors 24, no. 16: 5098. https://doi.org/10.3390/s24165098

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop