1. Introduction
Single image super resolution (SR) is a low-level task in the field of computer vision that aims to reconstruct a high resolution from a corresponding low-resolution image (LR), which is widely used in many applications, such as mobile devices, surveillance systems, autonomous driving, medical imaging, etc. However, SR is an ill-posed problem, since an identical LR image may be degenerated from different HR images. Therefore, SR is still a challenging task in terms of how to efficiently visually reconstruct HR images from degraded LR images.
To address this issue, Dong et al. [
1] proposed SRCNN, marking the first application of deep learning methods in the field of single image super-resolution (SR). Achieving significantly superior results compared to traditional methods, SRCNN utilizes a three-layer convolutional neural network. To mitigate the computational demands of SRCNN, Kim et al. [
2] introduced the VDSR model, incorporating residual learning to deepen the network to 20 layers and achieve rapid convergence. Lim et al. [
3] presented the EDSR model, simplifying the network structure by removing batch normalization layers (BN), enhancing the model’s representational capacity, and securing victory in the NTIRE2017 Super-Resolution Challenge. Zhang et al. [
4] proposed a residual-in-residual structure, pushing the depth of the Convolutional Neural Network (CNN) to 100–400 layers, yielding remarkably high PSNR values on benchmark datasets, surpassing previous methods.
While increasing the depth and width of CNNs can significantly enhance the network performance, it also comes with a significant computational cost and memory overhead, which restricts its use in practical applications like mobile devices, robots, edge computing, etc. In order to solve the issue, it is essential to develop a network that is both lightweight and incredibly effective.
Several prior methods [
5,
6,
7,
8,
9,
10] have been proposed in order to achieve a better trade-off between SR performance and computational efficiency. However, there are issues such as a small receptive field, slow convergence speed, information loss, and the fact that all network structures are constructed by a single-scale convolution kernel.
Recent research indicates that efficient operator design is crucial for building efficient SR CNNs. Factorized convolution, as an efficient operator, can break down standard convolution operations into two smaller convolution operations, effectively reducing the computational complexity and the number of parameters while maintaining network performance. However, there has been no prior study on efficient SR algorithms based on factorized convolution. Therefore, in this paper, we adopt a factorized convolution approach and construct an efficient Cross-Scale Interaction Block (CSIB).
The design rationale of CSIB is as follows: Firstly, we employ a convolution layer to decrease the number of parameters and expedite the training process, followed by splitting them into two separate branches. To alleviate the computational burden, we employ factorized convolution to factorize the standard depth-wise convolution into and depth-wise convolutions. One branch makes use of the factorized depth-wise convolution to extract local fine-grained features, while the other branch utilizes factorized depth-wise dilated convolution to capture global coarse-grained features. To prevent grid artifacts, different CSIB modules employ varying dilation rates. It is important to note that, for the more effective integration of cross-scale contextual information, we conduct interaction operations at the end of the dual-branch structure. This design not only reduces the computational complexity but also adeptly merges the local details and global features in the image. To further refine the aggregated contextual information, we designed an Efficient Large Convolutional Kernel Attention (ELKA) using large convolutional kernels and a gating mechanism.
We built a Cross-Scale Interaction Network, named “CSINet”, by stacking CSIBs to extract multi-scale contextual information. This design balances the performance and computational efficiency, and our CSINet outperforms most lightweight SISR methods in terms of both the performance and computational complexity. Specific experimental results can be seen in
Figure 1. This is a summary of the key contributions of our work:
We adopted a factorized convolution approach to design a Cross-Scale Interaction Block (CSIB). CSIBs employ a dual-branch structure to extract both local fine-grained features and global coarse-grained features. Furthermore, we utilize interaction operations at the end of the dual-branch structure, facilitating the integration of cross-scale contextual information;
We designed an Efficient Large Convolutional Kernel Attention (ELKA) with limited additional computation for refining and extracting features. Ablation studies validated the effectiveness of this attention module;
Comprehensive experiments on benchmark datasets show that our CSINet outperforms most state-of-the art lightweight SR methods.
2. Related Work
2.1. Lightweight Image SR
To improve the network speed while maintaining superior reconstruction results, several lightweight image super-resolution networks have been introduced [
1,
7,
11,
12,
13,
14]. These networks can be broadly categorized into three groups: network structure design, knowledge distillation, and pruning. In the network structure design methods, FSRCNN [
1] is the first lightweight super-resolution model. It performs upsampling at the end of the network, significantly improving the processing speed, but the performance of image reconstruction still needs improvement. CARN [
14] designs a cascaded residual module based on grouped convolution and adopts a mechanism of local and global cascading to fuse multi-layer features, thereby accelerating the model’s running speed. PAN [
8] designs self-calibrated blocks with pixel attention and upsampling blocks, achieving competitive performance with only 272K parameters.
In knowledge distillation methods, IDN [
15] uses
and
convolutions to construct an information distillation module, distilling the current feature map through channel separation, achieving real-time performance while maintaining reconstruction accuracy. Based on IDN, IMDN [
11] introduces a multi-information distillation module that extracts a part of useful features each time and integrates the remaining features into the distillation step of the next stage. After completion, the features extracted in each step are connected together. Subsequently, RFDN [
13] combines feature distillation connections and shallow residual blocks to construct a residual feature distillation block, achieving a better performance than IMDN with fewer parameters.
In pruning methods, SCCVLAB [
16] uses a fine-grained channel pruning strategy to address image super-resolution problems, achieving satisfactory results. SMSR [
7] prunes redundant computations by learning spatial and channel masks, achieving a better performance with an improved inference efficiency.
Although the aforementioned methods are lightweight and efficient, the quality of SR reconstruction still requires significant improvement.
2.2. Attention Mechanism of Image SR
Researchers in the field of image super-resolution have adopted the attention mechanism, which was initially developed for natural language processing tasks [
17,
18], and has proven effective in image super-resolution.
Hu et al. [
18] proposed using channel attention (CA), which assigns a weight to each feature channel based on its significance and improves the feature representation by amplifying the features with high weights and suppressing those with low weights. Hui et al. [
15] enhanced the channel attention mechanism with contrast-aware channel attention (CCA), which assigns channel weights according to the sum of the standard deviation and the mean. Wang et al. [
19] introduced efficient channel attention (ECA), which uses 1D convolution to efficiently capture dependencies across channels, to make the attention mechanism lighter.These attention mechanisms exhibit state-of-the-art performance in SR tasks [
4,
8,
15].
Some studies have introduced spatial attention to enrich the feature map. Wang et al. [
20] proposed an additional attention mechanism, non-local attention, which captures global context information by computing pixel-to-pixel dependencies. Nevertheless, this mechanism incurs a substantial computational overhead. To address this issue, Liu et al. [
13] proposed enhanced spatial attention (ESA), which reduces the channel dimensions by employing a
convolutional layer followed by a stride convolution to expand the receptive field. The max pooling operation with a large window and stride then focuses on the feature’s crucial spatial information. EFDN [
10] and BSRN [
6] also demonstrate superior performance with ESA.
Guo et al. [
21] proposed a novel linear attention mechanism named Large Kernel Attention (LKA) that utilizes the large receptive field of large convolutional kernels to achieve the effects of adaptability and long-range correlations similar to self-attention. The LKA attention mechanism has demonstrated excellent performance in various computer vision tasks [
22,
23]. However, the use of large convolutional kernels in LKA can introduce a significant computational burden. To address this, we decompose the large convolutional kernels in LKA into smaller ones, achieving results comparable to LKA while significantly reducing the computational requirements.
2.3. Factorized Convolution
Factorized convolution has emerged as a promising technique in efficient neural network design. It involves breaking down a standard convolution operation into multiple smaller convolution operations, typically aimed at reducing the computational complexity and model parameters. This technique has found widespread application in various computer vision tasks, including image classification, object detection, and semantic segmentation.
One common form of factorized convolution is depth-wise Separable Convolution, where a standard convolution layer is decomposed into two independent operations: depth-wise convolution and point-wise convolution. Depth-wise convolution independently filters each input channel spatially, while point-wise convolution combines the filtered outputs from each channel. This factorization significantly reduces the number of parameters, resulting in more efficient models.
Recent research has demonstrated the immense potential of factorized convolution in enhancing the efficiency of neural networks. For instance, MobileNet [
24] introduced depth-wise Separable Convolution, creating lightweight models suitable for mobile devices. ERFNet [
25] factorized
convolutions into
and
convolutions, achieving substantial performance improvements in semantic segmentation tasks. Subsequent studies like DABNet [
26], LEDNet [
27], and MSCFNet [
28] have further improved upon this technique and successfully applied it to their respective tasks, further emphasizing the importance of factorized convolution in efficient network design.
While factorized convolution has been successful in tasks such as image classification and object detection, its application in enhancing the efficiency of super-resolution neural networks has been relatively limited. Despite the successes observed in image classification and object detection tasks, the untapped potential of factorized convolution in improving the efficiency of super-resolution neural networks remains largely unexplored.
To address this gap, we propose an innovative approach in this work, applying factorized convolution to super-resolution networks. Our method fully leverages the advantages of factorized convolution to create highly efficient and lightweight architectures capable of delivering high-quality image super-resolution results.
4. Experiments
4.1. Experiment Setup
4.1.1. Datasets and Metrics
Following previous research [
5,
7,
10,
11,
12,
13], we train our models using the recently popularized dataset DIV2K [
30] with 800 high-quality images. Five standard benchmark datasets are used to evaluate our models: Set5 [
31], Set14 [
32], BSD100 [
33], Urban100 [
34], and Manga109 [
35]. To objectively assess the performance of our model, we convert the image to the YCbCr color space and compute the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) metrics on the luminance channel.
PSNR stands for Peak Signal-to-Noise Ratio and is a measure of image quality that compares the original image to the compressed or distorted image. It is defined as
where
is the maximum pixel value of the image, and
is the Mean Squared Error between the original and compressed/distorted images. Higher values of the PSNR indicate a better image quality.
SSIM stands for Structural Similarity Index and is a metric that compares the structural similarity of two images, taking into account the luminance, contrast, and structure. It is defined as:
where
x and
y are the two images being compared,
and
are their respective means,
and
are their respective variances, and
is their covariance.
and
are constants used to avoid instability when the means are close to zero. The SSIM value ranges between −1 and 1, where a value of 1 indicates perfect similarity.
4.1.2. Training Details
During the training phase, LR training images are generated by downsampling HR images with scaling factors (, , and ) using bicubic interpolation in MATLAB R2017a. We apply random horizontal or vertical flips and 90° rotations to the training set. In each mini-batch, inputs consisting of LR color patches are selected. The Adan optimizer will be used to train our model with the parameters , , , and with an initial learning rate of . In the training stage, we use L1 to train our network for iterations, and reduce the learning rate by half at and iterations. Subsequently, in the fine-tuning stage, we switch to the L2 to fine-tune our network with a learning rate of , and a total of iterations.
We replaced the convolution in the FARG model with a convolution, creating a smaller CSINet called CSINet-S. We trained CSINet-S using the DIV2K and Flickr2K datasets. During the training process, the input patch size is set to and the mini-batch is set to 64. The Adan optimizer will be used to train our model with the parameters , , , and with an initial learning rate of . We use L1 to train our network for iterations, and reduce the learning rate by half at and iterations. Subsequently, in the fine-tuning stage, we switch to the L2 to fine-tune our network with a learning rate of , and a total of iterations.
The proposed networks are implemented using the PyTorch framework and trained on a single NVIDIA 3090 GPU.
4.2. Ablation Study
4.2.1. Effectiveness of Dilation Rate
In deep learning-based image super-resolution methods, the receptive field size of the network is an important factor that affects the ability of the network to capture spatial information from the input image. The dilation rate is a common way to adjust the receptive field size of a CNN. A larger dilation rate means a larger receptive field, which can capture more global contextual information, while a smaller dilation rate means a smaller receptive field, which can capture more local details.
As shown in
Table 1, we conducted extensive experiments to investigate the effects of different dilation rates on the image super-resolution performance. Specifically, we adopted the concept from [
26,
28] and tested seven different dilation configurations. Our experimental results demonstrate that the choice of the dilation rate has a significant impact on the quality of the super-resolved images.
Among the tested dilation configurations, we found that setting the dilation rates to (1,3,3,5) consistently produces superior results across multiple benchmark datasets. These results are in line with previous studies that have also shown the effectiveness of large dilation rates in image super-resolution tasks.
4.2.2. Effectiveness of CSIB
The CSIB is intended to enhance the model’s reconstruction performance by effectively fusing multi-scale features from different branches. This is achieved through its parallel branching and cross-fusion structures. To evaluate the effectiveness of the CSIB, two similar modules were designed for comparative analysis (
Figure 6).
The Multi-Branch Feature Fusion Block (MFFB), splits the input features into two branches using channel splitting and halving operations. The multi-scale contextual information is then extracted from these two branches using depth-wise factorized convolution. The Cascade Dilated Fusion Block (CDFB), employs a cascade structure of three
depth-wise dilated convolutions instead of the two-branch structure used in MFFB. Both of these modules were integrated into corresponding SR networks, named a Multi-Branch Feature Fusion Network (MFFNet) and Cascaded Dilated Fusion Network (CDFNet), respectively. Extensive experiments were conducted to evaluate the performance of these three SR networks, with the results shown in
Table 2.
In terms of the reconstruction accuracy, the outcomes of these experiments clearly indicate that the CSIB is superior to both MFFB and CDFB. The CSIB achieved greater PSNR and SSIM values, indicating that the reconstructed images were more accurate. In addition, it did so with fewer parameters and at a lower computational cost, proving the efficacy of the interactive fusion structure in the SR reconstruction procedure. Compared to CDFB, CSIB not only requires fewer parameters, but also demonstrates a significant performance advantage in terms of reconstruction. This demonstrates the CSIB’s ability to not only have a large receptive field, but also effectively combine complementary information from multiple scales to improve the model’s representational capabilities.
The visual analysis of CDFB, MFFB, and CSIB is presented in
Figure 7. As depicted in
Figure 7a, CDFB shows promising results in recovering a portion of the butterfly’s streak profile, albeit with some blurring. In contrast,
Figure 7b presents MFFB which stands out due to its ability to extract more details of the stripes. This enhanced performance is attributed to its effective use of multi-scale feature extraction modules, which facilitates the recovery of intricate details with remarkable precision.
Furthermore, the proposed CSIB, shown in
Figure 7c, also utilizes multi-scale feature extraction modules, leading to a superior restoration performance when compared to the aforementioned models. CSIB excels in reconstructing high-frequency details and edge information with exceptional clarity, as evidenced in the results. The findings highlight the proficiency of CSIB in structural texture restoration and demonstrate the immense potential of deep learning models in image processing applications.
4.2.3. Effectiveness of Factorized Convolution
To validate the effectiveness of factorized convolution, we replaced it with regular convolution in CSIB, denoted as “w/RC”.
Upon reviewing the results in
Table 3, it is clear that the inclusion of factorized convolution leads to a reduction of 14 K parameters and a decrease of 0.8 G FLOPs compared to regular convolution. Simultaneously, PSNR and SSIM exhibit improvements across all benchmark datasets. Furthermore, the inference time decreased by 1.24 ms. These findings indicate that the introduction of factorized convolution not only enhances the model’s lightweight characteristics but also contributes to significant performance improvements.
4.2.4. Effectiveness of ELKA and ESA
The ablation studies on the two attention modules—ELKA, and ESA—are presented in
Table 4. The results indicate that ELKA is a highly effective module. We observed a significant decrease in the network performance when ELKA was removed, with a decrease of approximately 0.2 dB in the Set5 and Set14 datasets, and a decrease of over 0.4 dB in the Urban100 and Manga109 datasets. Furthermore, ESA has a positive impact on the model’s performance, as evidenced by a substantial decmidrule in the performance when ESA is removed.
These findings demonstrate that combining ELKA and ESA can effectively increase the model’s capacity. It is noteworthy that ELKA provides a more computationally efficient way to incorporate global information, while ESA modules can enhance the local feature representation. Thus, the combination of these attention modules offers a well-balanced and effective solution to improve the model’s performance.
To further observe the benefits produced by our ELKA module, we visualize the feature maps before and after ELKA for different FARGs, as shown in
Figure 8. It can be observed that the ELKA module enhances high-frequency information, making the edges and structural textures in the output features clearer.
4.3. Comparison with the SOTA SR Methods
To verify the effectiveness of the proposed model, we compare our CSINet model with 14 lightweight state-of-the-art SISR methods, including SRCNN [
1], VDSR [
2], CARN [
14], IDN [
15], MAFFSRN [
36], SMMR [
7], IMDN [
11], PAN [
8], LAPAR-A [
12], RFDN [
13], Cross-SRN [
37], FDIWN [
38], RLFN [
5], and BSRN [
6]. The results of the comparisons are presented in
Table 5. To assess the model’s size, we used two metrics: the number of parameters and the number of operations (Multi-Adds), calculated on a high-resolution image of
. Our method achieved outstanding results on all the datasets with various scaling factors, outperforming most of the other state-of-the-art networks in both the PSNR and SSIM measurements. Despite having fewer parameters and Multi-Adds, our CSINet outperformed techniques such as LAPAR-A, RFDN, Cross-SRN, and even the RLFN, which was awarded second place in the sub-track2 (Overall Performance Track) of the NTIRE 2022 efficient super-resolution challenge. These results illustrate the effective balance between image quality and computational efficiency that our method achieves.
We have incorporated the Non-Reference Image Quality Evaluator (NIQE) into our evaluation metrics to provide a more comprehensive analysis of the performance of our model compared to other lightweight models, including VDSR, CARN, IMDN, PAN, EFDN, and RLFN, as shown in
Table 6. In the comparison, we computed the NIQE scores for the outputs of our model and the aforementioned lightweight models. The NIQE score measures the naturalness of an image, with lower scores indicating a better image quality. Our model achieved comparable or slightly lower NIQE scores compared to these models, indicating that our model produces images with similar or slightly better naturalness. These results suggest that our model not only performs competitively in terms of traditional evaluation metrics such as PSNR and SSIM but also maintains or enhances the perceptual quality of the super-resolved images according to the NIQE score. It demonstrates the effectiveness of our lightweight model in preserving image quality while reducing the computational complexity.
The visual comparisons of our method with several state-of-the-art methods are presented in
Figure 9,
Figure 10 and
Figure 11. The results demonstrate the superiority of our method in terms of the image quality.
For Set14, we compared the models’ ability to reconstruct the “baboon” and “monarch” images. Our findings suggest that while the SRCNN [
1] and VDSR [
2] models recovered most of the stripe contours, their reconstructions still exhibited blurriness. In contrast, our proposed model, CSINet, was able to reconstruct high-frequency details with greater clarity. For the “monarch” image, CSINet was also superior in reproducing the butterfly antennae with greater clarity.
On the BSD100 dataset, we evaluated the performance of the models on the “108005” and “148026” images. Our results indicate that Bicubic failed to reproduce the basic texture features when reconstructing the details of the stripes on the tiger. While other models, such as CARN [
14], IMDN [
11], PAN [
8], and EFDN [
10], could recover more stripe details, their reconstructed images still exhibited some blurriness. In contrast, CSINet was able to reconstruct high-frequency details with greater clarity, outperforming all the other models. For the “148026” image, CSINet also produced reconstructed images with clear texture and rich details, which were closer to the real images than the other models.
Finally, on the Urban100 dataset, we evaluated the models’ ability to restore the “img_092” image. Our results suggest that most of the models, except for Bicubic, could restore the horizontal stripes of the building facade but still exhibited some blurriness. In contrast, the reconstructed images from CSINet had clear texture and rich details, approaching perfection. Similarly, for the “img_062” image in the Urban100 test set, the reconstructed images using Bicubic, SRCNN [
1], and VDSR [
2] were severely distorted and blurry. While the reconstructed results using CARN [
14], IMDN [
11], PAN [
8], EFDN [
10], and E-RFDN [
13] were slightly clearer, the glass window grids were distorted and deformed. In contrast, the reconstructed images using CSINet proposed in this study had clear texture and rich details, which were closer to the real images.
Overall, our subjective visual effect comparisons demonstrate that CSINet outperforms other state-of-the-art super-resolution models, providing high-frequency details that are clearer and closer to the real images.
4.4. Complexity Analysis
The runtime of a network is a crucial metric, even for lightweight SR algorithms. We conducted comparative experiments on the Set5 dataset (
) to assess the reconstruction speeds of mainstream networks. The experiments were run on an NVIDIA 3090 GPU with 24 GB RAM. The test images had a spatial resolution of
pixels. After 10 repeated runs, the average inference times were obtained and are presented in
Figure 12. It can be observed that our CSINet not only achieves the fastest reconstruction speed but also delivers the best reconstruction quality, demonstrating the significant advantages of our lightweight CSINet.
To further validate the lightweight nature of CSINet, we deployed it on the NVIDIA Jetson Xavier NX Developer Kit, known as one of the world’s smallest AI supercomputers for embedded MEC systems. We conducted experiments on real-world photos to evaluate the effectiveness of CSINet in the embedded MEC system. In these scenarios, ground-truth images and downsampling kernels were unavailable. As depicted in
Figure 13, our method successfully reconstructs sharper and more accurate images compared to state-of-the-art approaches. This indicates that our lightweight model excels in achieving exceptional super-resolution performance, making it highly suitable for deployment in embedded MEC systems.
4.5. Discussions
The effectiveness of the proposed Cross-Scale Interaction Block (CSIB) is a key highlight of our study. CSIB stands out as a crucial component in enhancing the overall performance of CSINet.
Firstly, CSIB is meticulously designed for super-resolution (SISR), integrating cross-scale contextual information using depth-wise convolution and dilated convolution. This design choice proves effective in capturing and leveraging contextual details across different scales, contributing to improved image reconstruction.
Secondly, the incorporation of Efficient Lightweight Kernel Aggregation (ELKA) within CSIB further enhances the model’s representational capacity. ELKA plays a pivotal role in aggregating relevant features efficiently, contributing to the model’s ability to capture intricate details and patterns.
The experimental results underscore the effectiveness of CSIB. In comparison to scenarios where regular convolution is used, the inclusion of factorized convolution within CSIB leads to significant reductions in parameters and FLOPs while simultaneously improving PSNR, SSIM, and reducing the inference time. This indicates that CSIB not only reduces the model complexity but also positively impacts the image quality and computational efficiency.
In visual comparisons with state-of-the-art methods, CSINet equipped with CSIB excels in reconstructing high-frequency details with exceptional clarity. This suggests that the designed cross-scale interaction mechanism within CSIB plays a pivotal role in capturing and utilizing contextual information effectively, resulting in superior image reconstruction.
CSIB emerges as a crucial element contributing to the effectiveness of CSINet. Its innovative design and integration within the network significantly improve image quality, demonstrating the efficacy of the proposed cross-scale interaction strategy in the context of lightweight super-resolution.
5. Conclusions
In this paper, we introduce the Cross-Scale Interaction Network (CSINet), a novel architecture designed for lightweight image super-resolution (SISR). Specifically, we present a lightweight Cross-Scale Interaction Block (CSIB) tailored for SISR. This block is carefully crafted to integrate cross-scale contextual information using depth-wise convolution and dilated convolution, leading to an effective reduction in the model complexity. Additionally, the integration of Efficient Lightweight Kernel Aggregation (ELKA) enhances the model’s representational capacity. The proposed network is characterized by its lightweight nature, with only 366K parameters. Extensive experiments conducted on benchmark datasets validate that CSINet outperforms the majority of state-of-the-art lightweight SR methods. Remarkably, it achieves superior results with fewer parameters and Multi-Adds, underscoring its efficiency and effectiveness.
In future work, to enhance the applicability of CSINet in real-time scenarios, optimizing model parameters and the inference time will become crucial for achieving a more lightweight model. This optimization will remain a central focus in our ongoing research, with the goal of ensuring the seamless integration of CSINet into real-time application environments.