BSRT++: Improving BSRT with Feature Enhancement, Weighted Fusion, and Cyclic Sampling

Son, Suji; Park, Hanhoon

doi:10.3390/electronics13163178

Open AccessArticle

BSRT++: Improving BSRT with Feature Enhancement, Weighted Fusion, and Cyclic Sampling

by

Suji Son

¹ and

Hanhoon Park

^1,2,*

¹

Division of Electronics and Communications Engineering, Pukyong National University, 45 Yongso-ro, Nam-gu, Busan 48513, Republic of Korea

²

Department of Artificial Intelligence Convergence, Graduate School, Pukyong National University, 45 Yongso-ro, Nam-gu, Busan 48513, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(16), 3178; https://doi.org/10.3390/electronics13163178

Submission received: 12 July 2024 / Revised: 9 August 2024 / Accepted: 10 August 2024 / Published: 11 August 2024

(This article belongs to the Special Issue AI Synergy: Vision, Language, and Modality)

Download

Browse Figures

Versions Notes

Abstract

:

Multi-frame super-resolution (MFSR) generates a super-resolution (SR) image from a burst consisting of multiple low-resolution images. Burst Super-Resolution Transformer (BSRT) is a state-of-the-art deep learning model for MFSR. However, in this study, we show that there is room for further improvement of BSRT in the feature extraction and fusion process. Then, we propose a feature enhancement module (FEM), a cyclic sampling module (CSM), and a feature reweighting module (FRM) and integrate them into BSRT. Finally, we demonstrate that the modules can help recover the high-frequency information well, enhance inter-frame communication, and suppress misaligned features, thus significantly improving the SR performance and producing more visually plausible and pleasant results compared to other MFSR methods, including BSRT. On the SyntheticBurst and RealBurst datasets, the improved BSRT with the modules, dubbed BSRT++, achieved higher PSNR values of 1.15 dB and 1.31 dB than BSRT, respectively.

Keywords:

multi-frame super-resolution; feature enhancement; feature reweighting; cyclic sampling; burst super-resolution

1. Introduction

Super-resolution (SR) involves the enhancement of a low-resolution (LR) image by increasing its resolution to reveal the underlying high-resolution (HR) image. This process finds applications in diverse fields such as surveillance, forensics, microscopy, and remote sensing [1]. However, the nature of SR is inherently ill-posed due to the presence of multiple HR images that can give rise to a single LR image.

The approach to SR is divided into single image super-resolution (SISR) and multi-frame super-resolution (MFSR) depending on the number of LR images used as input. MFSR approaches are able to aggregate subpixel information from multiple frames or shifted images of the same scene, alleviating the ill-posed nature of SR and generating a higher quality image compared to SISR approaches. However, MFSR imposes new challenges such as the alignment of input images (captured from different viewpoints) with subpixel accuracy and fusion of aligned images. Misaligned pixels act as noise during the fusion process and consequently become a major factor in deteriorating SR performance. Even if the input images are well aligned, the quality of the SR image varies depending on the fusion method because the amount or quality of information of each image required to generate the final SR image is different.

As a representative deep learning approach to MFSR, DBSR [2] introduced an innovative attention-based fusion module to address noisy RAW snapshots captured from handheld cameras. It utilized dense pixel-wise optical flow estimation to explicitly align the deep features of the input images. The aligned representations were subsequently merged through the computation of element-wise fusion weights. This approach allowed for the adaptive selection of reliable and informative content from each image. Similarly, BSRT [3] introduced a pyramid flow-guided deformable convolution network (Pyramid FG-DCN), which allows for a more accurate and flexible alignment of features through pyramid flow estimation and deformable convolution. In addition, it introduced the Swin Transformer [4] to capture both global and local contexts of images, significantly improving SR performance and achieving state-of-the-art performance. Several MFSR methods have since been proposed, but BSRT is still one of the best MFSR methods.

However, we found some important points that were overlooked during the feature extraction and fusion of BSRT. First, the input images (or their features) have different high-frequency information, and this high-frequency information can contribute to making the final SR image have rich details. However, during the BSRT fusion process, this high-frequency information is weakened. Second, the fusion layer using 1 × 1 convolution is not effective for inter-frame communication (especially between temporally spaced images), which may degrade SR performance. Third, the features extracted from each input image are of different importance, but BSRT fuses all the features with the same weight. In addition, BSRT does not take into account misaligned features. To address these issues, we propose three modules and integrate them into BSRT. The modified BSRT is dubbed BSRT++ hereafter. First, a feature enhancement module (FEM) further enhances the high-frequency information of each feature before the fusion process, helping to generate more detailed SR images. Second, a cyclic sampling module (CSM) cyclically samples neighboring frames and then hierarchically fuses their features so that inter-frame communication was significantly enhanced. Third, a feature reweighting module (FRM) calculates the importance of features to be used as weights in the fusion process, which helps suppress misaligned features.

Notice that most MFSR methods use LR-HR paired datasets for supervision, but it is too difficult to obtain paired datasets in real environments, and synthetically generated datasets cannot completely simulate real-world conditions such as noises and blurs. For this reason, self-supervised methods [5] have been proposed that do not require paired datasets. However, in this study, assuming that the paired datasets are given, we focus on developing an improved supervised MFSR method.

2. Related Work

2.1. Single Image Super-Resolution

SISR is a technology that generates an HR image from a single LR image. Traditionally, local patch-based SR techniques using linear mapping have been widely studied to produce high-quality HR images with relatively low complexity and computational power. However, these techniques have limited performance due to the nonlinear nature between LR and HR images.

Recently, research on deep learning-based SR methods has been actively conducted, showing higher performance than traditional methods. SRCNN [6], which is the first network to introduce deep learning to an SISR problem, is a three-layer fully convolutional network that improves the image quality of images enlarged by bicubic interpolation. This demonstrates that deep learning significantly improves SR performance. Since then, various methods based on deep learning have been proposed to deal with the SISR problem. VDSR [7] introduced a very deep convolutional network and showed that increasing network depth via residual learning can significantly improve SR performance, and ESPCN [8] proposed an efficient subpixel convolutional layer that converts image upsampling to pixel shuffling. EDSR [9] is designed to learn more stably by dividing the deep network into residual blocks and connecting each using a skip connection. In addition, by excluding the batch normalization process, it was possible to reduce the blur effect and memory problems. RCAN [10] also bypasses abundant low-frequency information through various skip connections, allowing the network to focus on learning high-frequency information, and it suggests a channel attention mechanism to readjust channel-specific features taking into account the dependence between channels, showing a better SR image quality improvement. Since then, Transformer has begun to be applied to SISR, and SwinIR [4] is a pioneering example. Unlike previous convolutional neural network (CNN)-based models, Transformer-based models can effectively model long-range dependencies between pixels, but since a lot of training data and GPU memory are required to train the model, research to develop efficient models is being actively conducted. To this end, ESRT [11], a hybrid model combining CNN and Transformer, has been proposed, alleviating the high computational burden of Transformer and showing the high potential of using Transformer in SISR. LBNet [12] is a lightweight model that combines a symmetric CNN for local feature extraction and a recursive Transformer for learning the global context of images, showing better performance in performance, size, execution time, and GPU memory consumption. Most recently, diffusion models [13] have shown immense promise in SR by generating SR images that are even more aesthetically pleasing and realistic. However, diffusion models also come with various challenges such as high computational complexity and color shift. To this end, a number of follow-up studies are underway.

In summary, despite advances in CNN, Transformer, and diffusion models for SISR, developing SISR methods with high accuracy and computational efficiency remains a challenge due to the limited information of the single-input image and the high complexity of deep learning models.

2.2. Multi-Frame Super-Resolution

MFSR is a technology that generates HR images from multiple LR images (frames), and it is possible to generate higher quality images by extracting information contained in multiple LR images and complementarily fusing them. However, in order to fuse multiple image information, the process of accurately aligning the images is essential, and the aligned features must be effectively fused. Therefore, the problem of alignment and fusion is considered the most important issue in MFSR.

The MFSR problem was first solved by modeling the relationship between input images in the frequency domain [14]. However, there was a problem in that visual artifacts occurred when multiple frames were processed in the frequency domain. Afterwards, an iterative backprojection algorithm [15] has been proposed to refine SR images while repeatedly minimizing the reconstruction errors between LR images obtained by downsampling the initial estimated SR images and input LR images, and an approach [16] that minimizes alignment errors between LR images through region-based matching methods was proposed as well.

Recently, deep learning-based MFSR methods have been proposed to take advantage of deep learning technology. For example, TDAN [17] introduced a deformable convolution network (DCN) to solve the problem of alignment between images. DCN is a network that expands the existing convolution operation to extract features considering the deformation of the area and shows excellent performance when there is movement or shape deformation within the area [18]. TDAN used this DCN to perform flexible alignment between images. EDVR [19] proposed a pyramid, cascading, and deformable convolution (PCD) module that can also handle the alignment of images with relatively large movements.

The image (or feature) alignment methods for MFSR are largely divided into implicit alignment and explicit alignment. Implicit alignment is a method to indirectly learn the relationship between images in the learning process without directly performing alignment between input images, whereas explicit alignment pre-aligns the input images for use in subsequent SR processes. In the case of implicit alignment, the learning process is relatively simple because prior knowledge of alignment is not required, but the MFSR method that uses implicit alignment has the disadvantage of being sensitive to unexpected changes or noise between images because the relationship between frames must be learned by itself. In contrast, explicit alignment requires pre-alignment but allows the MFSR method to use explicitly aligned images in SR processes, resulting in more stable and reliable SR results.

BIPNet [20] introduced an implicit alignment module based on deformable convolution and a pseudo-feature fusion module to enable flexible information exchange between images. AFCNet [21], based on BIPNet, improved the feature extraction process by integrating Restormer [22], which is a Transformer-based image restoration model. Burstormer [23] extracted features with reduced noise by applying a transformer-based attention module prior to alignment in the AFCNet framework. In addition, in the feature fusion process, neighboring frames were cyclically sampled and then hierarchically fused through a Transformer-based attention module so that inter-frame communication was not limited to adjacent frames.

DBSR [2] showed excellent MFSR results by combining an explicit alignment module using PWC-Net [24] with an attention-based fusion module. However, DBSR may be effective when the movement between LR images is small but not when the movement is relatively large. EBSR [25] proposed an improved PCD module, which allows for a more accurate and flexible alignment of images with large movements through deformable convolution operations. In addition, a non-local fusion module was proposed that effectively fuses information extracted from different areas within each image to further improve SR performance. As a Transformer-based MFSR model that uses explicit alignment, BSRT introduced the Pyramid FG-DCN that uses SpyNet [26] to obtain pyramid flows between multiple frames and DCNs [18] to more accurately align multi-scale features. BSRT further improved SR performance by capturing both global and local contexts for long-distance dependency modeling using Swin Transformer as the backbone. BSRT is described in more detail in the next section. FBAnet [27] warped each frame using the homography obtained by taking the first frame as a reference and fused the aligned deep features using weights computed from the differences in their affinity maps. FBAnet fed the fused features to two hourglass Transformer blocks to model long-range pixel dependencies.

2.3. BSRT

Given multiple noisy RAW images as input, BSRT aims to generate a denoised, demosaicked, and super-resolved RGB image as output. The process pipeline of BSRT is shown in Figure 1. The input images

x_{k} (k = 1, \dots N)

are N 4-channel RAW images. First, they are 2× upscaled to 1-channel ‘RGGB’ images by pixel shuffling and converted to 3-channel RGB images by a 3 × 3 convolution. Letting one of them be the reference frame, the optical flow between the reference frame and each frame is obtained by a pre-trained SpyNet. Meanwhile, the original RAW images are sent to several Swin Transformer blocks to extract features. Then, the feature maps are 2× upscaled by pixel shuffling as well. The feature maps are aligned to the reference frame’s feature map using the Pyramid FG-DCN alignment module with the preacquired optical flows. After that, these aligned feature maps

y_{k}

are fused by a 1 × 1 convolution and passed through several Swin Transformer groups to reconstruct the HR image.

The main components of BSRT are Pyramid FG-DCN, a pyramid flow-based alignment module, and Swin Transformer-based feature extraction and reconstruction modules. Pyramid FG-DCN can predict fine distortion and offset in the alignment process, allowing a more precise alignment of feature maps, and the pyramid structure improves alignment accuracy by considering feature maps of various resolutions and sizes in the alignment process. In the feature extraction and image reconstruction process, Swin Transformer blocks and groups can capture long-range dependencies to aggregate correlated high-frequency information, enabling the extraction of useful features and reconstruction of high-quality images. For these advantages, BSRT won first place on the real-world track of the NTIRE 2022 Bust Super-Resolution Challenge, and SR results on synthetic and real-world datasets demonstrate that BSRT achieves state-of-the-art performance and produces more visually plausible and pleasant results.

However, BSRT has limited performance by overlooking the following points, so we want to further improve its performance in this study.

-: The input images contain complementary high-frequency information, which must be fully reconstructed (the solution will be given in Section 3.1).
-: In MFSR, flexible information exchange between long-range images (frames) should be achieved [20,23] (the solution will be given in Section 3.2).
-: The features extracted from each input image are of different importance and they cannot be completely aligned by the alignment process (the solution will be given in Section 3.3).

3. BSRT++

BSRT++ is an improvement to BSRT to produce a high-quality SR images from a burst consisting of multiple LR RAW images. BSRT++ has three additional modules compared to BSRT: FEM, CSM, and FRM. The process pipeline of BSRT++ is shown in Figure 2.

3.1. Feature Enhancement Module

Each input image contains complementary high-frequency information that plays an important role in enhancing the clarity and detail of the SR image. To ensure that high-frequency information is well recovered, FEM extracts and enhances high-frequency information from feature maps aligned by the Pyramid FG-DCN alignment module.

Inspired by [20], the aligned feature maps

y_{k}

are enhanced as

{\tilde{y}}_{k} = y_{k} + C o n v (y_{k} - \bar{y}), k = 1, \dots, N .

(1)

Here,

\tilde{y}

is the enhanced feature map and N is the number of input frames.

\bar{y}

is the feature map of the reference frame. The residual feature map between each frame and the reference frame is computed, enhanced by a convolution operation, and finally added to the feature map of each frame. This enhances the detailed texture information unique to each frame, allowing the detailed texture information to be better reconstructed in the SR results. In contrast, FEM may boost the noise or misalignment errors of the feature maps. However, we assume that the feature maps are free of noise or misalignment at this stage because they were refined and aligned through Swin Transformers and the Pyramid FG-DCN alignment module. The noise and misalignment errors will be addressed in the FRM (Section 3.3).

3.2. Cyclic Sampling Module

Unlike BSRT, which simply concatenates feature maps before performing the 1 × 1 convolution operation for fusion, feature maps are rearranged and hierarchically concatenated. Inspired by [23], neighboring feature maps are sampled in duplicate, pairwise, and cyclically, and then the pairwise concatenations continue to be concatenated, as shown in Figure 2.

f = C o n c a t (c_{1}, \dots, c_{N / 2}, d_{1}, \dots, d_{N / 2}),

(2)

where

\begin{matrix} c_{i} = C o n c a t (W ({\tilde{y}}_{2 i - 1}, {\tilde{y}}_{2 i})), i = 1, \dots, \frac{N}{2}, \\ d_{i} = C o n c a t (W ({\tilde{y}}_{2 i}, {\tilde{y}}_{2 i + 1 (\mod N)})), i = 1, \dots, \frac{N}{2} . \end{matrix}

Here,

C o n c a t

is the concatenation operation with

d i m = 1

;

W

denotes FRM that will be explained in Section 3.3. The final feature map f is sent to the 1 × 1 convolution operation for fusion. This sampling module facilitates information exchange between neighboring frames and enables effective interaction between distant frames at the same time, allowing long-range context to be used in the image reconstruction process. Furthermore, it is significantly more efficient compared to the approach of using a pseudo-feature set produced by channel-wise concatenating all feature maps [20].

3.3. Feature Reweighting Module

In BSRT, the feature maps are equally weighted in the fusion process. However, in our BSRT++, feature maps are reweighted based on their similarity to the reference feature map. Furthermore, since feature maps are paired in the CSM and then fused, the reweighting process

W

is performed in pairs. That is,

{\hat{y}}_{l / m}, {\hat{y}}_{m / l} = W ({\tilde{y}}_{l}, {\tilde{y}}_{m}) .

(3)

Here,

{\hat{y}}_{l / m}

and

{\hat{y}}_{m / l}

represent the reweighted feature maps and are computed as

{\hat{y}}_{l / m} = \frac{ψ_{m}}{ψ_{l} + ψ_{m}} {\tilde{y}}_{l}, {\hat{y}}_{m / l} = \frac{ψ_{l}}{ψ_{l} + ψ_{m}} {\tilde{y}}_{m},

(4)

where

\begin{matrix} ψ_{l} = \sum_{j} |{\tilde{y}}_{l} (j) - \bar{y} (j)|, ψ_{m} = \sum_{j} |{\tilde{y}}_{m} (j) - \bar{y} (j)| . \end{matrix}

Here,

ψ_{l}

is the sum of the pixel differences (errors) between the l-th feature map and the reference feature map.

y (j)

indicates the j-th pixel value of y. The larger the sum of errors, the more likely it is not accurately aligned or contains noise, so a relatively small weight is given. This reweighting process helps to reconstruct high-quality images by suppressing alignment errors and noise while ensuring that the details present in the reference frame are fully restored. Furthermore, the pairwise process helps to prevent misalignment errors or noise in a feature map from propagating to distant feature maps.

4. Experimental Results and Discussion

To verify the performance of the proposed method, we conducted an experiment by super-resolving an input image with a upscale factor of 4 on the SyntheticBurst and RealBurst dataset provided by the NTIRE 2022 Burst Super-Resolution Challenge [28]. The SyntheticBurst dataset consists of 46,839 sRGB images with a size of 448 × 448 that are used to synthesize RAW burst sequences for training and evaluation. Every sequence comprises 14 LR RAW images, each with a spatial resolution of

p \times p

pixels (

p = 48

for evaluation), generated synthetically from a single sRGB image in the following manner. Initially, the provided sRGB image undergoes transformation into RAW space using the inverse camera pipeline [29]. Subsequently, random rotations and translations are implemented on this RAW image to create the HR burst sequence. The HR burst is then downsampled into an LR RAW burst sequence with the application of Bayer mosaicking and random noise addition. The RealBurst dataset consists of 200 RAW burst sequences comprising 14 LR RAW images. The LR RAW images were taken with a Samsung smartphone camera, and the HR images were taken with a DSLR camera. From the sequences, 882 patches with a size of 80 × 80 were extracted for evaluation.

PyTorch (version 1.6.0) was used to implement the proposed SR model, and training and evaluation were performed on a PC with a GTX 2080 Ti GPU. The learning rate was set to

10^{- 4}

, and the burst size was set to 14. Due to equipment limitations, the batch size and the number of epochs were set to 1 and 50, respectively. In order to objectively evaluate the performance of the model applying the three proposed modules, the qualitative and quantitative image quality of the SR images was compared with the baseline BSRT model (BSRT has two models, BSRT-Small and BSRT-Large, and BSRT-Small has 5,026,321 and BSRT-Large has 20,824,561 model parameters [3]. In this study, BSRT-Small, a light model, was used as the baseline model). For the image quality evaluation of the SR images, peak signal-to-noise ratio (PSNR), structural similarity index map (SSIM), and learned perceptual image patch similarity (LPIPS) values were measured.

Figure 3 shows the validation accuracy of BSRT and BSRT++ according to the number of epochs. As the number of epochs increased, the validation accuracy of BSRT and BSRT++ continued to increase, and the difference in accuracy between BSRT and BSRT++ remained almost unchanged. This indicates that the models were reliably trained with good generalization ability and that BSRT++ will outperform BSRT even if the epoch number is further increased without loss of generality.

4.1. Ablation Study

Table 1 shows the PSNR, SSIM, and LPIPS results of BSRT++ with different sets of proposed modules on the SyntheticBurst dataset. Each module contributed to the improvement in BSRT++ performance. The PSNR value increased by 0.94 dB when FEM and CSM were added separately, and it increased by 1.22 dB when both were added. BSRT++ performed best with all modules (improved by 1.34 dB compared to BSRT with none of them). In terms of LPIPS, adding FEM alone showed the same performance as adding all modules. However, the increase in inference time due to the added modules was about 0.016 s, and there was no significant difference. We believe that BSRT++ is still highly efficient because it operates fairly quickly in our low-spec experimental environment. For experimental convenience, the results were obtained when

e p o c h s = 15

, but from Figure 3, we can predict that the improvement will be consistent even if the number of epochs increases. In Figure 4, it is shown that FEM contributed to the recovery of richer texture in the SR images, while CSM and FRM contributed to the reduction of visual artifacts or distortions due to misalignment. When all the modules were applied, the SR images of BSRT++ had the smallest distortion without visual artifacts and were closest to the ground truth. On the contrary, the SR images of BSRT failed to properly restore the high-frequency information related to local textures with severe distortion and visual artifacts (e.g., the straight lines and texture were broken or crushed; ghost textures were restored).

4.2. Comparison with Other Methods

BSRT++ was also compared with other MFSR methods (actually, we tried to add more recent methods for comparison, but some methods did not provide the source codes and some required too large resources to train and test): HighResNet, DBSR, and Burstormer. The results are shown in Table 2.

All MFSR methods perform better than the SISR method. BSRT outperformed DBSR and HighRes-net with 1/2

e p o c h s

on all metrics. However, Burstormer had higher PSNR and SSIM values than BSRT with shorter processing times. Burstormer performed worse in terms of LPIPS, indicating that the resulting SR images may have perceptually inferior image quality. On the other hand, BSRT++ generated SR results with better perceptual and quantitative image quality than BSRT or Burstormer. As mentioned in the ablation study, the processing time of BSRT++ was slightly longer than that of BSRT due to the added modules. In Figure 5, where SR images produced by different MFSR methods were shown, it is shown that DBSR and BSRT suffered from visual artifacts or distortion and failed to recover fine texture and details (e.g., the straight lines were blurred or curved and ghost lines were restored). Burstormer produced visually more plausible SR images with reduced visual artifacts and distortion. However, BSRT++ best recovered the fine texture and details of the ground truth with little visual artifacts and distortion.

5. Conclusions

In this study, we proposed BSRT++, which improves the feature extraction and fusion process of BSRT with three modules: FEM, CSM, and FRM. The modules recovered the complementary high-frequency information of each frame, enhanced inter-frame communication, and suppressed misaligned features, respectively. In the experimental results, all modules contributed to improving the perceptual and quantitative quality of the SR images. When all three modules were applied, BSRT++ performed best and outperformed other state-of-the-art MFSR methods, including BSRT.

In this study, we assumed that the feature maps from multiple frames are well aligned through the Pyramid FG-DCN alignment module. However, these assumptions are often not established. In this study, the misalignment was partially resolved using the FRM module, but not completely. Therefore, in a future study, we plan to improve the alignment module using a network or approach that is more effective for flow calculation or feature alignment.

In this study, the effectiveness of BSRT++ was demonstrated on the SyntheticBurst and RealBurst datasets. However, further evaluation on other datasets will provide more comprehensive evidence of its robustness and generalizability across different scenarios and image types, leaving it as another future study.

Author Contributions

Conceptualization, S.S. and H.P.; Funding acquisition, H.P.; Methodology, S.S. and H.P.; Software, S.S.; Supervision, H.P.; Validation, S.S. and H.P.; Writing—original draft, S.S.; Writing—review and editing, Hanhoon Park. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) Grant by the Korean Government through the MSIT under Grant 2021R1F1A1045749.

Data Availability Statement

The data that support the findings of this study are publicly available in the online repository: http://people.ee.ethz.ch/~ihnatova/pynet.html#dataset (accessed on 10 August 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SISR	single image super-resolution
MFSR	multi-frame super-resolution
BSRT	Burst Super-Resolution Transformer
FEM	feature enhancement module
CSM	cyclic sampling module
FRM	feature reweighting module
LR	low resolution
HR	high resolution
SR	super-resolution
CNN	convolutional neural network
FG-DCN	flow-guided deformable convolution network
PCD	pyramid, cascading, and deformable convolution
PSNR	peak signal-to-noise ratio
SSIM	structural similarity index map
LPIPS	learned perceptual image patch similarity

References

Yue, L.; Shen, H.; Li, J.; Yuan, Q.; Zhang, H.; Zhang, L. Image super-resolution: The techniques, applications, and future. Signal Process. 2016, 128, 389–408. [Google Scholar] [CrossRef]
Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Deep Burst Super-Resolution. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 9205–9214. [Google Scholar]
Luo, Z.; Li, Y.; Cheng, S.; Yu, L.; Wu, Q.; Wen, Z.; Fan, H.; Sun, J.; Liu, S. BSRT: Improving Burst Super-Resolution with Swin Transformer and Flow-Guided Deformable Alignment. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 997–1007. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Gool, L.V.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar] [CrossRef]
Bhat, G.; Gharbi, M.; Chen, J.; Gool, L.V.; Xia, Z. Self-Supervised Burst Super-Resolution. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 10571–10580. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image Super-Resolution Using Deep Convolutional Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1646–1654. [Google Scholar]
Shi, W.; Caballero, J.; Huszar, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; IEEE Computer Society: Los Alamitos, CA, USA, 2016; pp. 1874–1883. [Google Scholar] [CrossRef]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1132–1140. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. In Proceedings of the European Conference on Computer Vision (ECCV) 2018, Munich, Germany, 8–14 September 2018; pp. 294–310. [Google Scholar]
Lu, Z.; Li, J.; Liu, H.; Huang, C.; Zhang, L.; Zeng, T. Transformer for Single Image Super-Resolution. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 456–465. [Google Scholar] [CrossRef]
Gao, G.; Wang, Z.; Li, J.; Li, W.; Yu, Y.; Zeng, T. Lightweight Bimodal Network for Single-Image Super-Resolution via Symmetric CNN and Recursive Transformer. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Vienna, Austria, 23–29 July 2022; pp. 913–919. [Google Scholar] [CrossRef]
Moser, B.B.; Shanbhag, A.S.; Raue, F.; Frolov, S.; Palacio, S.M.; Dengel, A. Diffusion Models, Image Super-Resolution And Everything: A Survey. arXiv 2024, arXiv:2401.00736. [Google Scholar]
Tsai, R.; Huang, T.S. Multiframe image restoration and registration. Adv. Comput. Vis. Image Process. 1984, 1, 317–339. [Google Scholar]
Peleg, S.; Keren, D.; Schweitzer, L. Improving image resolution using subpixel motion. Pattern Recognit. Lett. 1987, 5, 223–226. [Google Scholar] [CrossRef]
Bascle, B.; Blake, A.; Zisserman, A. Motion deblurring and super-resolution from an image sequence. In Proceedings of the Computer Vision—ECCV’96, Cambridge, UK, 14–18 April 1996; pp. 571–582. [Google Scholar]
Tian, Y.; Zhang, Y.; Fu, Y.; Xu, C. TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3357–3366. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar] [CrossRef]
Wang, X.; Chan, K.K.; Yu, K.; Dong, C.; Loy, C. EDVR: Video Restoration with Enhanced Deformable Convolutional Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; pp. 1954–1963. [Google Scholar] [CrossRef]
Dudhane, A.; Zamir, S.W.; Khan, S.; Khan, F.A.; Yang, M.H. Burst Image Restoration and Enhancement. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20 June 2022; pp. 5749–5758. [Google Scholar]
Mehta, N.; Dudhane, A.; Murala, S.; Zamir, S.W.; Khan, S.; Khan, F.S. Adaptive Feature Consolidation Network for Burst Super-Resolution. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 1278–1285. [Google Scholar] [CrossRef]
Zamir, S.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.; Yang, M. Restormer: Efficient Transformer for High-Resolution Image Restoration. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20 June 2022; pp. 5718–5729. [Google Scholar] [CrossRef]
Dudhane, A.; Zamir, S.W.; Khan, S.S.; Khan, F.S.; Yang, M. Burstormer: Burst Image Restoration and Enhancement Transformer. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 5703–5712. [Google Scholar]
Sun, D.; Yang, X.; Liu, M.; Kautz, J. PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8934–8943. [Google Scholar] [CrossRef]
Luo, Z.; Yu, L.; Mo, X.; Li, Y.; Jia, L.; Fan, H.; Sun, J.; Liu, S. EBSR: Feature Enhanced Burst Super-Resolution with Deformable Alignment. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 19–25 June 2021; pp. 471–478. [Google Scholar] [CrossRef]
Ranjan, A.; Black, M.J. Optical Flow Estimation Using a Spatial Pyramid Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2720–2729. [Google Scholar]
Wei, P.; Sun, Y.; Guo, X.; Liu, C.; Li, G.; Chen, J.; Ji, X.; Lin, L. Towards Real-World Burst Image Super-Resolution: Benchmark and Method. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 13187–13196. [Google Scholar] [CrossRef]
Bhat, G.; Danelljan, M.; Timofte, R. NTIRE 2022 Burst Super-Resolution Challenge. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 1040–1060. [Google Scholar] [CrossRef]
Brooks, T.; Mildenhall, B.; Xue, T.; Chen, J.; Sharlet, D.; Barron, J.T. Unprocessing Images for Learned Raw Denoising. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE Computer Society: Los Alamitos, CA, USA, 2019; pp. 11028–11037. [Google Scholar] [CrossRef]
Deudon, M.; Kalaitzis, A.; Goytom, I.; Arefin, M.R.; Lin, Z.; Sankaran, K.; Michalski, V.; Kahou, S.E.; Cornebise, J.; Bengio, Y. HighRes-net: Recursive Fusion for Multi-Frame Super-Resolution of Satellite Imagery. arXiv 2020, arXiv:2002.06460. [Google Scholar]

Figure 1. Process pipeline of BSRT.

Figure 2. Process pipeline of BSRT++. The architecture of BSRT++ is the same as that of BSRT, except for FEM, CSM, and FRM.

Figure 3. Validation accuracy of BSRT and BSRT++ according to the number of epochs.

Figure 4. Visual comparison of SR images produced by BSRT++ with different modules.

Figure 5. Visual comparison of SR images produced by different MFSR methods.

Table 1. Performance evaluation of BSRT++ with different modules when

e p o c h s = 15

. The best results are in bold.

Table 1. Performance evaluation of BSRT++ with different modules when

e p o c h s = 15

. The best results are in bold.

FEM	FRM	CSM	PSNR ↑	SSIM ↑	LPIPS ↓	Time (s)
-	-	-	39.23	0.948	0.055	0.338
√	-	-	40.17	0.956	0.052	0.340
-	-	√	40.17	0.955	0.054	0.350
√	-	√	40.45	0.957	0.053	0.352
√	√	√	40.57	0.959	0.052	0.354

Table 2. Comparison of BSRT++ with other burst SR methods when

e p o c h s = 50

. The best results are in bold.

Table 2. Comparison of BSRT++ with other burst SR methods when

e p o c h s = 50

. The best results are in bold.

Method	SyntheticBurst			RealBurst			Time (s)
Method	PSNR ↑	SSIM ↑	LPIPS ↓	PSNR ↑	SSIM ↑	LPIPS ↓	Time (s)
Single Image [2]	36.86 ^a	0.919 ^a	0.113 ^a	44.02	0.972	0.051	0.400
HighRes-net [30]	37.45 ^a	0.924 ^a	0.106 ^a	43.99	0.972	0.051	0.463
DBSR ^b [2]	39.09	0.945	0.084	45.17	0.978	0.037	0.431
BSRT [3]	40.27	0.951	0.047	46.92	0.982	0.035	0.342
Burstormer [23]	41.13	0.967	0.108	47.77	0.983	0.039	0.201
BSRT++	41.42	0.968	0.046	48.23	0.984	0.024	0.374

^a The results were brought from [3] and obtained when

e p o c h s = 100

. ^b The results were obtained when

e p o c h s = 100

.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Son, S.; Park, H. BSRT++: Improving BSRT with Feature Enhancement, Weighted Fusion, and Cyclic Sampling. Electronics 2024, 13, 3178. https://doi.org/10.3390/electronics13163178

AMA Style

Son S, Park H. BSRT++: Improving BSRT with Feature Enhancement, Weighted Fusion, and Cyclic Sampling. Electronics. 2024; 13(16):3178. https://doi.org/10.3390/electronics13163178

Chicago/Turabian Style

Son, Suji, and Hanhoon Park. 2024. "BSRT++: Improving BSRT with Feature Enhancement, Weighted Fusion, and Cyclic Sampling" Electronics 13, no. 16: 3178. https://doi.org/10.3390/electronics13163178

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

BSRT++: Improving BSRT with Feature Enhancement, Weighted Fusion, and Cyclic Sampling

Abstract

1. Introduction

2. Related Work

2.1. Single Image Super-Resolution

2.2. Multi-Frame Super-Resolution

2.3. BSRT

3. BSRT++

3.1. Feature Enhancement Module

3.2. Cyclic Sampling Module

3.3. Feature Reweighting Module

4. Experimental Results and Discussion

4.1. Ablation Study

4.2. Comparison with Other Methods

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI