A Lightweight Recurrent Grouping Attention Network for Video Super-Resolution

Zhu, Yonggui; Li, Guofang

doi:10.3390/s23208574

Open AccessArticle

A Lightweight Recurrent Grouping Attention Network for Video Super-Resolution

by

Yonggui Zhu

^1,*

and

Guofang Li

²

¹

School of Data Science and Intelligent Media, Communication University of China, Beijing 100024, China

²

School of Information and Communication Engineering, Communication University of China, Beijing 100024, China

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(20), 8574; https://doi.org/10.3390/s23208574

Submission received: 22 September 2023 / Revised: 16 October 2023 / Accepted: 17 October 2023 / Published: 19 October 2023

(This article belongs to the Section Sensing and Imaging)

Download

Browse Figures

Versions Notes

Abstract

:

Effective aggregation of temporal information of consecutive frames is the core of achieving video super-resolution. Many scholars have utilized structures such as sliding windows and recurrences to gather the spatio-temporal information of frames. However, although the performances of constructed video super-resolution models are improving, the sizes of the models are also increasing, exacerbating the demand on the equipment. Thus, to reduce the stress on the device, we propose a novel lightweight recurrent grouping attention network. The parameters of this model are only 0.878 M, which is much lower than the current mainstream model for studying video super-resolution. We have designed a forward feature extraction module and a backward feature extraction module to collect temporal information between consecutive frames from two directions. Moreover, a new grouping mechanism is proposed to efficiently collect spatio-temporal information of the reference frame and its neighboring frames. The attention supplementation module is presented to further enhance the information gathering range of the model. The feature reconstruction module aims to aggregate information from different directions to reconstruct high-resolution features. Experiments demonstrate that our model achieves state-of-the-art performance on multiple datasets.

Keywords:

video super-resolution; temporal grouping attention; attention supplementation; feature reconstruction

1. Introduction

Super-resolution (SR) refers to yielding high-resolution (HR) images from corresponding low-resolution (LR) images. As a branch of this field, video super-resolution (VSR) mainly utilizes the spatial information of the current frame and the temporal information between neighboring frames to reconstruct the HR frame. At present, VSR encompasses non-blind VSR [1], blind VSR [2], online VSR [3], and other branches [4], and is widely used in remote sensing [5,6], video surveillance [7,8], face recognition [9,10], and other fields [11,12]. At present, with the development of technology, the resolutions of videos are gradually increasing. Although this can enrich our lives and facilitate tasks such as surveillance and identification, it can put more pressure on areas such as video storage and transmission. In addressing these issues, VSR technology plays an important role. However, VSR is an ill-posed problem, and it is difficult to find the most appropriate reconstruction model. Thus, it remains a worthwhile endeavor to continue to explore VSR technology.

To obtain high-quality images, previous studies have proposed numerous effective methods. Initially, researchers utilize interpolation methods to obtain HR videos [13,14]. These methods possess higher computing speeds, but the results are poor. With the development of deep learning, constructing models [15,16,17] in different domains with deep learning has become a mainstream research method. Researchers have constructed different VSR models based on deep learning that can reconstruct high-quality videos. For example, researchers [18,19,20,21] have utilized explicit or implicit alignments to explore temporal flow between frames. This type of methodology can effectively align adjacent frames to the reference frame to extract high-quality temporal information. However, the alignment feature increases the computational effort of the model, thus exacerbating the burden during model training and testing. Meanwhile, inaccurate optical flow often leads to errors in alignment, which affects the performances of models. Moreover, scholars [22,23,24] have used 3D convolution or deformable 3D convolution to directly aggregate spatio-temporal information between different frames. Although this approach can quickly aggregate information from different times, it also incorporates a lot of temporal redundancy in features, which reduces the reconstruction ability of the model. In addition, in recent years, with the rise of Transformer, the application of Transformer to construct VSR models has also become a very popular research topic. Researchers [25,26,27] have applied Transformer to analyze and acquire the motion trajectories of videos to sufficiently aggregate the spatio-temporal information between consecutive frames. However, due to the relatively high level of computation required by Transformer, the further development of Transformer in the field of VSR is limited.

In numerous studies on VSR models, it is demonstrated that although the reconstruction ability of VSR models is becoming stronger, the framework of the model is also becoming larger. Several recent papers on VSR [26,27,28,29,30,31] present models with parameter counts of 6M or more, which undoubtedly increases the burden of model training and testing, thus affecting the application of these models in real scenarios. Thus, to ameliorate this problem, this paper focuses on the VSR model for small-scale parameters. Our goal is to obtain high-quality reconstructed frames by utilizing fewer parameters. In a specific operation, we design a model with parameters less than 1M that achieves relatively favorable results at a scale much lower than the mainstream models. This construction of the model helped us to reduce the dependence of the model on device, and makes it easier to apply the model to online VSRs and mobile phones in the future.

In this paper, we present a novel lightweight recurrent grouping attention network (RGAN). It is a bi-directional propagation VSR model that can effectively aggregate information from different time ranges. In the RGAN, we construct the forward feature extraction module (FFEM) and the backward feature extraction module (BFEM), which are able to efficiently aggregate temporal information over long distances passed in both forward and backward directions. In addition, we propose a novel temporal grouping attention module (TGAM) that divides input frames at each time step into the reference group and the fusion group. This grouping method can fully extract the information of reference frame and adjacent frames, while ensuring the stability of the model and preventing large temporal offsets. Then, we design the attention supplementation module (ASM). This module increases the scope of information collection, and can more efficiently assist the model in recovering the detailed information of frames. After utilizing the FFEM and BFEM to effectively aggregate and adequately extract temporal information for different ranges, we design the feature reconstruction module (FRM) to capture features obtained from FFEM and BFEM. This module can effectively integrate the temporal information of the two propagation stages and enhance the reconstruction capability of the model. Experiments demonstrate that our model possesses better performance. The contributions of this paper are listed as follows:

We design a novel lightweight recurrent grouping attention network that achieves better model performance with a small number of parameters.
A new grouping method is designed to enhance the stability of model and effectively extract spatio-temporal information from reference frame and adjacent frames.
We design a new attention supplement module that enhances the range of information captured by the model and facilitates the recovery of more detailed information by the model.
Experiments indicate that our model achieves better results on Vid4, SPMCS, UDM10, and RED4 datasets.

The rest of the paper is organized as follows: In Section 2, we describe the work related to the model. In Section 3, we introduce the specific structure of the model. In Section 4, we provide details regarding the training and testing of the model, and compare our results with those of other models and ablation studies. In Section 5, we summarize the paper and present our future research plan.

2. Related Work

2.1. Single-Image Super-Resolution

Single-image super-resolution (SISR) is the basis of super-resolution. In recent years, with the development of deep learning, SR has ushered in a new revolution. Dong et al. [32] were the first to apply deep learning to SISR. They presented a three-layer convolution neural network and achieved a better effect. For example, when the review metric is peak signal-to-noise ratio (PSNR) and

4 \times

SISR is performed, it outperforms the then state-of-the-art A+ algorithm [33] with 0.21 dB and 0.18 dB on Set5 and Set14 datasets, respectively. It was thus proven that deep learning possesses great potential in the field of SR. After this paper, Kim et al. [34] presented a very deep neural network and applied the residual network to the SR model, achieving a better effect than SRCNN. For example, when performing 4× SISR and using PSNR as a metric for evaluation, it outperforms SRCNN with 0.87 dB and 0.52 dB on Set5 and Set14 datasets, respectively. Song et al. [35] came up with the idea of making use of the additive neural network for SISR, which replaced the traditional convolution kernel multiplication operation in the calculation of output layer. Experiments demonstrate that this additive neural network achieves performance and visual quality comparable to convolutional neural networks, while reducing energy loss by approximately 2.5 times when reconstructing a

1280 \times 720

image. Liang et al. [36] introduced Swin Transformer into SISR and obtained high-quality recovered images. Tian et al. [37] proposed heterogeneous grouping blocks to enhance the internal and external interactions of different channels to obtain rich low-frequency structural information. In practice, Lee et al. [38] applied the SR technique to the satellite synthetic aperture radar, and could effectively recover the information of scatterers. Moreover, many scholars have also constructed SISR models using methods such as GAN or VAE, etc. [38,39,40,41,42]. Although the SISR model can also be used to reconstruct HR videos, the SISR model is only capable of capturing the spatial information of frames, and can not aggregate the temporal information between neighboring frames. As a result, the quality of the video recovered by the SISR is poor, while often suffering from artifacts and other problems. To reconstruct high-quality HR videos, researchers have shifted their focus to VSR models.

2.2. Video Super-Resolution

VSR is an extension of SISR. In VSR, the temporal information between adjacent frames play a vital role. To reconstruct high-quality HR frames, studies have built a variety of models. For instance, Caballero et al. [19] applied the optical flow field, which included coarse flow and fine flow to align adjacent frames, and constructed an end-to-end spatio-temporal module. Based on [19], Wang et al. [43] combined an optical flow field and long short-term memory to make more efficient use of inter-frame information and obtain more realistic details. Moreover, Tian et al. [44] presented the first model to substitute the deformable convolution into VSR, which amplified the feature extraction ability of the model. Based on [44], Wang et al. [20] proposed a pyramid, cascading, and deformable (PCD) module that further enhances the alignment capability of the model. Then, Xu et al. [45] designed a temporal modulation block to modulate the PCD module. Meanwhile, they conducted short-term and long-term feature fusion to better extract motion clues. These optical flow-based methods have also been applied to practical work such as video surveillance, etc. Guo et al. [8] utilized optical flow and other methods to construct the back-projection network, which can effectively reconstruct high-quality surveillance videos. Moreover, Isobe et al. [22] proposed the structure of intra-group fusion and inter-group fusion, and used 3D convolution to capture and supplement the spatio-temporal information between different groups. Ying et al. [23] proposed deformable 3D convolution with efficient spatio-temporal exploration and adaptive motion compensation capabilities. Fuoli et al. [46] devised a hidden space propagation scheme that effectively aggregates temporal information over long distances. Based on [46], Isobe et al. [28] explored the temporal differences between LR and HR space, effectively complementing the missing details in LR frames. Then, Jin et al. [5] used the temporal difference between long and short frames to achieve information compensation for satellite VSR. Liu et al. [26] designed a trajectory transformer that analyzes and utilizes motion trajectories between consecutive frames to obtain high-quality HR videos. Then, on the basis of [26], Qiu et al. [27] introduced the frequency domain into the VSR domain, which provided a new basis upon which to study VSR.

Although all of the above methods are capable of reconstructing high-quality HR frames, the performances of these models are determined by utilizing larger model structures. These models can exacerbate the strain on equipment and cause significant resource loss. To avoid these problems, we propose a novel lightweight recurrent grouping attention network that is capable of obtaining better recovery with fewer parameters. This lightweight design can effectively reduce the loss of equipment, and possesses certain theoretical significance and practical application value.

3. Our Method

3.1. Overview

For the given consecutive frame

I_{0}^{L}, I_{1}^{L}, \dots, I_{T}^{L}

, our goal is to generate the corresponding HR frame

I_{0}^{H}, I_{1}^{H}, \dots, I_{T}^{H}

. Our proposed RGAN is a bi-directional propagation model where each HR frame

I_{t}^{H}

is generated by three consecutive frames

I_{t - 1}^{L}, I_{t}^{L}, I_{t + 1}^{L}

, pre-hidden state

h t_{t - 1}^{p r e}

, post-hidden state

h t_{t + 1}^{p o s t}

, forward output feature

O u t_{t - 1}^{p r e}

, and backward output feature

O u t_{t + 1}^{p o s t}

. The structure of the model is shown in Figure 1a. In the specific operation, we first input consecutive frames

I_{t - 1}^{L}, I_{t}^{L}, I_{t + 1}^{L}

into FFEM and BFEM. The purpose is to gather more spatio-temporal information in different temporal directions. In FFEM and BFEM, we utilize TGAM and ASM to aggregate feature information, respectively. The role of TGAM is to perform grouping of three consecutive frames of the input and gather the grouping information. The role of ASM is to collect spatio-temporal information from another perspective and increase access to information. Then, we present the FRM module to fuse and reconstruct the outputs from FFEM and BFEM, with the aim of obtaining the final output feature. Finally, the HR frame

I_{t}^{H}

is obtained by summing the feature generated by the model and the bicubic upsampling result of the reference frame

I_{t}^{L}

.

3.2. Forward/Backward Feature Extraction Module

The FFEM and BFEM are the core of RGAN, and they have similar structures. In this section, we take FFEM as an example to introduce the specific structures of two modules. In FFEM, we first input three consecutive frames

I_{t - 1}^{L}, I_{t}^{L}, I_{t + 1}^{L}

into TGAM. In TGAM, we divide the consecutive frames into the reference group and the fusion group based on the category of the extracted information and extract the information separately to obtain the forward reference group feature

F_{r e f}^{p r e}

and the forward fusion group feature

F_{f u s}^{p r e}

. Then, we input

F_{r e f}^{p r e}

,

F_{f u s}^{p o s t}

,

h t_{t - 1}^{p r e}

, and

O u t_{t - 1}^{p r e}

into a cell to obtain the aggregated features

F_{a g g}^{p r e}

. The cell consists of a

3 \times 3

convolution and a Leaky ReLU. After that, we use the ASM to further optimize the feature information of

F_{a g g}^{p r e}

. Finally, the optimized information is delivered to two branches. One is to utilize a cell to obtain the forward hidden state

h t_{t}^{p r e}

of the current time step, and the other is to utilize a cell to obtain the output

O u t_{t}^{p r e}

of the current time step. The

h t_{t}^{p r e}

and

O u t_{t}^{p r e}

will be applied to the next time step of the FFEM. Moreover, the

O u t_{t}^{p r e}

is applied to the FRM to synthesize the output of the current time step. BFEM and FFEM have approximately the same structure. The difference between both modules is that after TGAM, BFEM utilizes a single cell to aggregate the information of backward reference group feature

F_{r e f}^{p o s t}

, backward fusion group feature

F_{f u s}^{p o s t}

,

h t_{t + 1}^{p o s t}

, and

O u t_{t + 1}^{p o s t}

. In addition,

F_{r e f}^{p r e}

in FFEM is applied to the FRM, while

F_{r e f}^{p o s t}

in BFEM does not perform this action. In order to better represent the similar and different parts between two modules, we provide the formulas for two modules as follows:

\begin{matrix} h t_{t}^{p r e}, O u t_{t}^{p r e} = N_{F F E M} (I_{t - 1}^{L}, I_{t}^{L}, I_{t + 1}^{L}, h t_{t - 1}^{p r e}, O u t_{t - 1}^{p r e}), \\ h t_{t}^{p o s t}, O u t_{t}^{p o s t} = N_{B F E M} (I_{t - 1}^{L}, I_{t}^{L}, I_{t + 1}^{L}, h t_{t + 1}^{p o s t}, O u t_{t + 1}^{p o s t}), \end{matrix}

(1)

where

N_{F F E M}

and

N_{B F E M}

denote the FFEM and BFEM, respectively, and

h t_{t}^{p o s t}

and

O u t_{t}^{p o s t}

indicate the hidden state and output feature obtained by the BFEM module at the current time step. The FFEM and BFEM can effectively aggregate temporal information in different directions to enhance and optimize the feature extraction capability of the model.

3.3. Temporal Grouping Attention Module

The role of TGAM is to efficiently extract features between the reference frame and its neighboring frames, and it possesses the same structure in FFEM and BFEM. We present the ASM in FFEM as a case study to introduce the specific structure of the ASM. The structure of TGAM is shown in Figure 1b. For the three consecutive input frames

I_{t - 1}^{L}, I_{t}^{L}, I_{t + 1}^{L}

, we first divide them into the reference group

I_{t}^{L}

and the fusion group

I_{t - 1}^{L}, I_{t}^{L}, I_{t + 1}^{L}

. The purpose of the reference group is to maintain the temporal stability of the model and prevent large shifts in the generated features. Meanwhile, the aim of the fusion group is to efficiently aggregate temporal information between adjacent frames. For the reference group, we apply four cells to obtain feature

F_{r e f}^{p r e}

of the reference group. The formula for this operation is as follows:

F_{r e f}^{p r e} = \sum_{i = 1}^{4} C^{i} (I_{t}^{L}, θ_{r e f}^{i}),

(2)

where

C^{i}

and

θ_{r e f}^{i}

represent cells and the corresponding parameters. The cells below are expressed in the same manner. The role of the reference group is to exclude temporal interference and extract feature information only from the spatial domain, preventing large shifts in the FFEM during iterations. For the fusion group, we employ four cells to obtain feature

F_{f u s}^{p r e^{'}}

. Then, we use the temporal attention module (TAM) to further collect inter-frame information. The TAM is inspired by [22]. Firstly, we apply a cell to obtain the feature

F_{a t t}^{^{'}}

. Subsequently, we select one of the channel feature maps in

F_{a t t}^{^{'}}

and compute the attention mapping using the softmax function in the depth dimension. Then, we utilize element-wise multiplication to multiply the remaining channels in

F_{a t t}^{^{'}}

and temporal weights to obtain the attention feature

F_{a t t}^{p r e}

. The role of the TAM is to add a new path for extracting features of continuous frames, thus better extracting detailed spatio-temporal information of continuous frames. Finally, we splice

F_{f u s}^{p r e^{'}}

and

F_{a t t}^{p r e}

in the channel dimension and perform extraction using a single cell to obtain the final fusion group feature

F_{f u s}^{p r e}

. The structure of the TAM is illustrated in Figure 1c. The formula for this operation is as follows:

F_{r e f}^{p r e} = N_{T A M} (\sum_{i = 1}^{4} C^{i} (I_{t}^{L}, θ_{f u s}^{i})),

(3)

where

N_{T A M}

and

θ_{f u s}^{i}

denote the TAM and cell parameters, respectively. The role of the fusion group is to initially extract the spatio-temporal features of consecutive frames and prepare for the next feature supplementation and reconstruction.

3.4. Attention Supplementation Module

After fusing the information from different stages in the FFEM and BFEM, we apply the ASM to further increase the information acquisition range of the model to supplement the missing detail information. The core of the ASM is to use the densely connected structure and spatial attention mechanism to obtain more spatio-temporal information, and it possesses the same structure in the FFEM and BFEM. We describe the specific structure of the ASM using the ASM in the FFEM as an example. The structure of the ASM is displayed in Figure 1d. Firstly, we construct a modulation block that consists of a

1 \times 1

convolution, a

3 \times 3

convolution, a Leaky ReLU, and a spatial attention module [47]. Then, we apply a modulation block to extract

F_{a g g}^{p r e}

to obtain the feature

F_{A S M}^{1}

. Next, we splice

F_{a g g}^{p r e}

and

F_{A S M}^{1}

in the channel dimension and then extract the model with a modulation block to obtain the output feature

F_{A S M}^{2}

. After that, we splice

F_{a g g}^{p r e}

,

F_{A S M}^{1}

, and

F_{A S M}^{2}

in the channel dimension and then aggregate three different groups of features with a single cell. Finally, we employ three residual blocks [48] to further optimize the obtained features. The ASM is able to further extract the information that has been fused and increase the range of information accessed by the model. In addition, the densely connected design structure can effectively aggregate different types of information and enhance the utilization of features.

3.5. Feature Reconstruction Module

After the FFEM and BFEM, we use the FRM to aggregate two groups of temporal information obtained from different directions. The structure of the FRM is displayed in Figure 1e. Firstly, we utilize a cell to fuse

F_{o u t}^{p r e}

,

F_{o u t}^{p o s t}

, and

F_{r e f}^{p r e}

. The purpose of applying

F_{r e f}^{p r e}

to this aggregation stage is to further enhance the stability of the model and prevent bias in the generated feature. Afterwards, we apply three residual blocks to further optimize the fused features, and employ a

3 \times 3

convolution to adjust the number of channels of the output feature to 48 to facilitate upsampling. Finally, we use sub-pixel magnification [49] to obtain the final HR features. The FRM can effectively aggregate the information generated by the FFEM and BFEM in different directions, possesses strong reconstruction capability, and can obtain high-quality HR features.

4. Experiments

4.1. Training Datasets and Details

Datasets In this paper, we utilize Vimeo-90K [50] as the training dataset. This dataset contains over 90 K video sequences, each consisting of seven consecutive frames with a resolution of

448 \times 256

. Moreover, we apply Vid4 [51], SPMCS [52], UDM10 [53], and RED4 [54] as test datasets. These datasets contain sequences of different lengths and resolutions of natural environments, human landscapes, and other types of sequences, which can effectively indicate the performance and generalization ability of the model. Then, we utilize PSNR and structural similarity (SSIM) as evaluation metrics and perform tests on the Y channel in YCbCr space.

Implementation details For training, we choose HR frames

256 \times 256

in size, which are randomly selected among the sequences in the Vimeo-90K dataset, and each sequence selects the same region. The size of the LR image is

64 \times 64

by applying the Gaussian kernel with the standard deviation of

σ = 1.6

and

4 \times

downsampling. The initial learning rate is set to

1 \times 10^{- 4}

with a decay factor of 0.5 every 25 epochs until 75 epochs. The batch size is 8. Moreover, to ensure that all video sequences can be adequately trained and tested, we copy the first and last frames to complement the missing adjacent frames of the first and last frames. During training and ablation studies, we input seven consecutive frames for training. Meanwhile, to increase the range of the training dataset, we perform random rotations and flips of the input sequence. During testing, the number of frames input at a time depends on the length of the sequence. All training and assessments are experimented with Python 3.8, PyTorch 1.8, and RTX 3090 GPUs.

4.2. Comparison with State-of-the-Art Methods

In this section, we compare our model with several state-of-the-art models. The models for comparison include TOFLOW [50], FRVSR [55], D3D [23], and OVSR [56]. TOFLOW explored the relationships between neighboring frames using task-oriented motion cues. FRVSR adopted the HR result of the previous frame to generate the current frame, constructing the uni-directional VSR model. D3D designed deformable 3D convolutions to directly aggregate spatio-temporal information in continuous frames. OVSR devised a bi-directional omniscient network that effectively aggregates past, present, and future temporal information.

It is well known that different training datasets and downsampling methods affect model performance. Thus, to ensure the fairness of comparison, we retrain these models using the same training set and Gaussian downsampling. Moreover, to better compare at the same parameter magnitude, we adjust the number of channels and depth of the OVSR. The quantitative comparison results are recorded in Table 1. Comparing our results with those of TOFLOW, FRVSR, and D3D, our model is far superior to these models in terms of performance and runtime. This indicates that the small-scale models we designed outperformed these classical models in terms of performance. Moreover, compared with OVSR, we noticed that with similar parameters and runtime, our model is far better than OVSR in terms of performance. These findings demonstrate that our model has state-of-the-art performance with small-scale parameters.

After making quantitative comparisons, we also made qualitative comparisons of these models. The qualitative comparison results are displayed in Figure 2 and Figure 3. In Figure 2, it is revealed that our model offers better recovery in terms of numbers, etc. Moreover, in Figure 3, we demonstrate that our model possesses better ability for detail and edge restoration. These results further indicate that our proposed model has state-of-the-art performance.

4.3. Ablation Studies

Ablation studies of temporal grouping attention module. In the processing of three consecutive frames

I_{t - 1}^{L}, I_{t}^{L}, I_{t + 1}^{L}

, we propose two innovations. One is to process the reference frame as a separate group, and the other is to design the TAM in extracting features from three consecutive frames. To demonstrate the validity of these two constructions, we design corresponding ablation experiments. The ablation experiments consist of three configurations, which are removing the reference frame and TAM, only removing the reference group and only removing the TAM. The quantitative comparison results of the model are summarized in Table 2. In Table 2, we can find that when the reference group is removed, the performance of the model does not change much, with or without the addition of the TAM module. This suggests that it is meaningful to group reference frames individually. Then, with the addition of the reference group, supplementing the TAM effectively enhances the performance of the model, which proves that supplementing the TAM in the model is a worthwhile endeavor. Finally, comprehensive comparison indicates that with the addition of the reference group and the TAM, the model shows a significant improvement in performance with a small increase in parameters. This suggests that the addition of the reference group and the TAM module has positive implications for the model. Moreover, the qualitative comparison results of four groups of models are displayed in Figure 4. The comparison indicates that with the addition of the reference group and TAM, the model has a better recovery ability.

Ablation studies of the attention supplementation module. In the FFEM and BFEM, after fusing different types of spatio-temporal features, we design the ASM to further enhance the information extraction range of the model and supplement the missing temporal information. To demonstrate the role of the ASM, we construct the ablation experiment RGAN-N by replacing the ASM with the residual block. Moreover, in the modulation block in ASM, we supplement the spatial attention module to further enhance the model’s performance. To demonstrate the importance of this module, we designed the model without the spatial attention module, named RGAN-S. The quantitative comparison results of these ablation experiments are displayed in Table 3. Comparing RGAN and RGAN-N, we can indicate that the ASM has a positive effect on the performance of the model. Moreover, comparing RGAN-S and RGAN, it can be seen that adding the spatial attention module can effectively improve the performance of the model. These results prove that the ASM we designed is efficient and meaningful.

5. Conclusions

In this paper, we propose a novel lightweight recurrent grouping attention network that centers on obtaining better VSR results with small-scale parameters. We design a forward feature extraction module and a backward feature extraction module to obtain sufficient temporal information from two directions. The temporal grouping attention module is proposed to efficiently aggregate temporal information between the reference frame and adjacent frames. Moreover, the attention supplement module is used to further optimize the fused information and expand the information collection range of the model. Finally, we apply the feature reconstruction module to efficiently aggregate and restructure the information in different directions to obtain high-quality HR features. Experiments demonstrate that our models achieve excellent performance. The scale of our model is much lower than the current mainstream video super-resolution model, which means that our model is more suitable for applications in remote sensing and video surveillance, etc. In future research, we aim to build the VSR model for satellite video and virtual reality based on this model. Moreover, we will further optimize the model in the areas of the attention module, generation module, and loss function to improve its performance.

Author Contributions

Conceptualization, Y.Z.; methodology, Y.Z.; software, G.L.; validation, G.L.; formal analysis, Y.Z.; investigation, G.L.; resources, Y.Z.; data curation, G.L.; writing—original draft preparation, G.L.; writing—review and editing, Y.Z.; visualization, G.L.; supervision, Y.Z.; project administration, Y.Z.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

Supported by the National Natural Science Foundation of China (No. 11571325) and the Fundamental Research Funds for the Central Universities (No. CUC2019 A002).

Data Availability Statement

Our code is available at https://github.com/karlygzhu/RGAN (accessed on 22 September 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, W.; Zhou, M.; Ji, C.; Sui, X.; Bai, J. Cross-Frame Transformer-Based Spatio-Temporal Video Super-Resolution. IEEE Trans. Broadcast. 2022, 68, 359–369. [Google Scholar] [CrossRef]
Pan, J.; Bai, H.; Dong, J.; Zhang, J.; Tang, J. Deep Blind Video Super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 4791–4800. [Google Scholar]
Xiao, J.; Jiang, X.; Zheng, N.; Yang, H.; Yang, Y.; Yang, Y.; Li, D.; Lam, K. Online Video Super-Resolution with Convolutional Kernel Bypass Graft. IEEE Trans. Multimed. 2022, 1–16. [Google Scholar] [CrossRef]
Wang, Y.; Isobe, T.; Jia, X.; Tao, X.; Lu, H.; Tai, Y. Compression-Aware Video Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2023; pp. 2012–2021. [Google Scholar]
Jin, X.; He, J.; Xiao, Y.; Yuan, Q. Learning a Local-Global Alignment Network for Satellite Video Super-Resolution. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Xiao, Y.; Yuan, Q.; Jiang, K.; Jin, X.; He, J.; Zhang, L.; Lin, C. Local-Global Temporal Difference Learning for Satellite Video Super-Resolution. arXiv 2023, arXiv:2304.04421. [Google Scholar] [CrossRef]
Guarnieri, G.; Fontani, M.; Guzzi, F.; Carrato, S.; Jerian, M. Perspective registration and multi-frame super-resolution of license plates in surveillance videos. Digit. Investig. 2021, 36, 301087. [Google Scholar] [CrossRef]
Guo, K.; Guo, H.; Ren, S.; Zhang, J.; Li, X. Towards efficient motion-blurred public security video super-resolution based on back-projection networks. J. Netw. Comput. Appl. 2020, 166, 102691. [Google Scholar] [CrossRef]
Yu, F.; Li, H.; Bian, S.; Tang, Y. An Efficient Network Design for Face Video Super-resolution. In Proceedings of the Conference on Computer Vision Workshops, virtual event, 10–17 October 2021; pp. 1513–1520. [Google Scholar]
López-López, E.; Pardo, X.M.; Regueiro, C.V. Incremental Learning from Low-labelled Stream Data in Open-Set Video Face Recognition. Pattern Recognit. 2022, 131, 108885. [Google Scholar] [CrossRef]
Lee, Y.; Yun, J.; Hong, Y.; Lee, J.; Jeon, M. Accurate license plate recognition and super-resolution using a generative adversarial networks on traffic surveillance video. In Proceedings of the IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), Jeju, Republic of Korea, 24–26 June 2018; pp. 1–4. [Google Scholar]
Seibel, H.; Goldenstein, S.; Rocha, A. Eyes on the Target: Super-Resolution and License-Plate Recognition in Low-Quality Surveillance Videos. IEEE Access 2017, 5, 20020–20035. [Google Scholar] [CrossRef]
Zhang, L.; Wu, X. An edge-guided image interpolation algorithm via directional filtering and data fusion. IEEE Trans. Image Process. 2006, 15, 2226–2238. [Google Scholar] [CrossRef]
Liu, X.; Zhao, D.; Zhou, J.; Gao, W.; Sun, H. Image Interpolation via Graph-Based Bayesian Label Propagation. IEEE Trans. Image Process. 2014, 23, 1084–1096. [Google Scholar]
Tian, C.; Yuan, Y.; Zhang, S.; Lin, C.; Zuo, W.; Zhang, D. Image super-resolution with an enhanced group convolutional neural network. Neural Netw. 2022, 153, 373–385. [Google Scholar] [CrossRef] [PubMed]
Tian, C.; Zheng, M.; Zuo, W.; Zhang, B.; Zhang, Y.; Zhang, D. Multi-stage image denoising with the wavelet transform. Pattern Recognit. 2023, 134, 109050. [Google Scholar] [CrossRef]
Zhu, Z.; He, X.; Li, C.; Liu, S.; Jiang, K.; Li, K.; Wang, J. Adaptive Resolution Enhancement for Visual Attention Regions Based on Spatial Interpolation. Sensors 2023, 23, 6354. [Google Scholar] [CrossRef] [PubMed]
Wen, W.; Ren, W.; Shi, Y.; Nie, Y.; Zhang, J.; Cao, X. Video Super-Resolution via a Spatio-Temporal Alignment Network. IEEE Trans. Image Process. 2022, 31, 1761–1773. [Google Scholar] [CrossRef] [PubMed]
Caballero, J.; Ledig, C.; Aitken, A.P.; Acosta, A.; Totz, J.; Wang, Z.; Shi, W. Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation. In Proceedings of the Conference on Computer Vision Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2848–2857. [Google Scholar]
Wang, X.; Chan, K.C.K.; Yu, K.; Dong, C.; Loy, C.C. EDVR: Video Restoration With Enhanced Deformable Convolutional Networks. In Proceedings of the Conference on Computer Vision Pattern Recognition Workshops, Long Beach, CA, USA, 15–20 June 2019; pp. 1954–1963. [Google Scholar]
Wang, W.; Liu, Z.; Lu, H.; Lan, R.; Zhang, Z. Real-Time Video Super-Resolution with Spatio-Temporal Modeling and Redundancy-Aware Inference. Sensors 2023, 23, 7880. [Google Scholar] [CrossRef]
Isobe, T.; Li, S.; Jia, X.; Yuan, S.; Slabaugh, G.G.; Xu, C.; Li, Y.; Wang, S.; Tian, Q. Video Super-Resolution with Temporal Group Attention. In Proceedings of the Conference on Computer Vision Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 8005–8014. [Google Scholar]
Ying, X.; Wang, L.; Wang, Y.; Sheng, W.; An, W.; Guo, Y. Deformable 3D Convolution for Video Super-Resolution. IEEE Signal Process. Lett. 2020, 27, 1500–1504. [Google Scholar] [CrossRef]
Liu, H.; Zhao, P.; Ruan, Z.; Shang, F.; Liu, Y. Large Motion Video Super-Resolution with Dual Subnet and Multi-Stage Communicated Upsampling. In Proceedings of the AAAI Conference on Artificial Intelligence, in virtua, 2–9 February 2021; pp. 2127–2135. [Google Scholar]
Geng, Z.; Liang, L.; Ding, T.; Zharkov, I. RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution. In Proceedings of the Conference on Computer Vision Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17420–17430. [Google Scholar]
Liu, C.; Yang, H.; Fu, J.; Qian, X. Learning Trajectory-Aware Transformer for Video Super-Resolution. In Proceedings of the Conference on Computer Vision Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5677–5686. [Google Scholar]
Qiu, Z.; Yang, H.; Fu, J.; Liu, D.; Xu, C.; Fu, D. Learning Spatiotemporal Frequency-Transformer for Low-Quality Video Super-Resolution. arXiv 2022, arXiv:2212.14046. [Google Scholar]
Isobe, T.; Jia, X.; Tao, X.; Li, C.; Li, R.; Shi, Y.; Mu, J.; Lu, H.; Tai, Y.W. Look Back and Forth: Video Super-Resolution with Explicit Temporal Difference Modeling. In Proceedings of the Conference on Computer Vision Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17411–17420. [Google Scholar]
Chan, K.C.K.; Zhou, S.; Xu, X.; Loy, C.C. BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment. In Proceedings of the Conference on Computer Vision Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5962–5971. [Google Scholar]
Chan, K.C.K.; Wang, X.; Yu, K.; Dong, C.; Loy, C.C. BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond. In Proceedings of the Conference on Computer Vision Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4947–4956. [Google Scholar]
Lin, J.; Huang, Y.; Wang, L. FDAN: Flow-guided Deformable Alignment Network for Video Super-Resolution. arXiv 2021, arXiv:2105.05640. [Google Scholar]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a Deep Convolutional Network for Image Super-Resolution. In Proceedings of the Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Volume 8692, pp. 184–199. [Google Scholar]
Timofte, R.; De Smet, V.; Gool, L.V. Anchored Neighborhood Regression for Fast Example-Based Super-Resolution. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, 1–8 December 2013; pp. 1920–1927. [Google Scholar] [CrossRef]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proceedings of the Conference on Computer Vision Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1646–1654. [Google Scholar]
Song, D.; Wang, Y.; Chen, H.; Xu, C.; Xu, C.; Tao, D. AdderSR: Towards Energy Efficient Image Super-Resolution. In Proceedings of the Conference on Computer Vision Pattern Recognition, virtual event, 10–17 October 2021; pp. 15648–15657. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Gool, L.V.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. In Proceedings of the Conference on Computer Vision Workshops, virtual event, 10–17 October 2021; pp. 1833–1844. [Google Scholar]
Tian, C.; Zhang, Y.; Zuo, W.; Lin, C.; Zhang, D.; Yuan, Y. A heterogeneous group CNN for image super-resolution. arXiv 2022, arXiv:2209.12406. [Google Scholar] [CrossRef]
Lee, S.J.; Lee, S.G. Efficient Super-Resolution Method for Targets Observed by Satellite SAR. Sensors 2023, 23, 5893. [Google Scholar] [CrossRef]
Shi, Y.; Han, L.; Han, L.; Chang, S.; Hu, T.; Dancey, D. A Latent Encoder Coupled Generative Adversarial Network (LE-GAN) for Efficient Hyperspectral Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–19. [Google Scholar] [CrossRef]
Yang, F.; Yang, H.; Fu, J.; Lu, H.; Guo, B. Learning Texture Transformer Network for Image Super-Resolution. In Proceedings of the Conference on Computer Vision Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 5790–5799. [Google Scholar]
Malczewski, K. Diffusion Weighted Imaging Super-Resolution Algorithm for Highly Sparse Raw Data Sequences. Sensors 2023, 23, 5698. [Google Scholar] [CrossRef] [PubMed]
Zhang, D.; Tang, N.; Zhang, D.; Qu, Y. Cascaded Degradation-Aware Blind Super-Resolution. Sensors 2023, 23, 5338. [Google Scholar] [CrossRef]
Wang, Z.; Yi, P.; Jiang, K.; Jiang, J.; Han, Z.; Lu, T.; Ma, J. Multi-Memory Convolutional Neural Network for Video Super-Resolution. IEEE Trans. Image Process. 2019, 28, 2530–2544. [Google Scholar] [CrossRef] [PubMed]
Tian, Y.; Zhang, Y.; Fu, Y.; Xu, C. TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution. In Proceedings of the Conference on Computer Vision Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 3357–3366. [Google Scholar]
Xu, G.; Xu, J.; Li, Z.; Wang, L.; Sun, X.; Cheng, M. Temporal Modulation Network for Controllable Space-Time Video Super-Resolution. In Proceedings of the Conference on Computer Vision Pattern Recognition, virtual event, 10–17 October 2021; pp. 6388–6397. [Google Scholar]
Fuoli, D.; Gu, S.; Timofte, R. Efficient Video Super-Resolution through Recurrent Latent Space Propagation. In Proceedings of the Conference on Computer Vision Workshops, Long Beach, CA, USA, 15–20 June 2019; pp. 3476–3485. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. In Proceedings of the Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 294–310. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the Conference on Computer Vision Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 1132–1140. [Google Scholar]
Shi, W.; Caballero, J.; Huszar, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the Conference on Computer Vision Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1874–1883. [Google Scholar]
Xue, T.; Chen, B.; Wu, J.; Wei, D.; Freeman, W.T. Video Enhancement with Task-Oriented Flow. Int. J. Comput. Vision. 2019, 127, 1106–1125. [Google Scholar] [CrossRef]
Liu, C.; Sun, D. On Bayesian Adaptive Video Super Resolution. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 346–360. [Google Scholar] [CrossRef]
Tao, X.; Gao, H.; Liao, R.; Wang, J.; Jia, J. Detail-Revealing Deep Video Super-Resolution. In Proceedings of the Conference on Computer Vision, Honolulu, HI, USA, 21–26 July 2017; pp. 4482–4490. [Google Scholar]
Yi, P.; Wang, Z.; Jiang, K.; Jiang, J.; Ma, J. Progressive Fusion Video Super-Resolution Network via Exploiting Non-Local Spatio-Temporal Correlations. In Proceedings of the Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019; pp. 3106–3115. [Google Scholar]
Nah, S.; Baik, S.; Hong, S.; Moon, G.; Son, S.; Timofte, R.; Lee, K.M. NTIRE 2019 Challenge on Video Deblurring and Super-Resolution: Dataset and Study. In Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 15–20 June 2019; pp. 1996–2005. [Google Scholar]
Sajjadi, M.S.M.; Vemulapalli, R.; Brown, M. Frame-Recurrent Video Super-Resolution. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Munich, Germany, 8–14 September 2018; pp. 6626–6634. [Google Scholar]
Yi, P.; Wang, Z.; Jiang, K.; Jiang, J.; Lu, T.; Tian, X.; Ma, J. Omniscient Video Super-Resolution. In Proceedings of the Conference on Computer Vision, virtual event, 10–17 October 2021; pp. 4409–4418. [Google Scholar]

Figure 1. (a) The overall pipeline of the recurrent grouping attention network. (b) The structure of the temporal grouping attention module. (c) The structure of the temporal attention module. (d) The structure of the attention supplementation module. (e) The structure of the feature reconstruction module.

Figure 2. Qualitative comparison of Vid4, SPMCS, and UDM10 datasets for 4× VSR.

Figure 3. Qualitative comparison of SPMCS, UDM10, and RED4 datasets for 4× VSR.

Figure 4. Qualitative comparison results of the temporal grouping attention module for 4× VSR. The numbers 1, 2, 3, and 4 correspond to the different methods in Table 2.

Table 1. Quantitative comparison (PSNR(dB) and SSIM) of Vid4, SPMCS11, UDM10, and RED4 datasets for 4× VSR. The bold portion indicates the best performance. The runtime is calculated based on the LR image of

320 \times 180

.

Table 1. Quantitative comparison (PSNR(dB) and SSIM) of Vid4, SPMCS11, UDM10, and RED4 datasets for 4× VSR. The bold portion indicates the best performance. The runtime is calculated based on the LR image of

320 \times 180

.

Method	Bicubic	TOFLOW [50]	FRVSR [55]	D3D [23]	OVSR [56]	Ours
Params (M)	-/-	1.4	5.1	2.6	0.895	0.878
Runtime (ms)	-/-	493	114	119	19	18
Vid4	21.80/0.5246	25.85/0.7659	26.69/0.8103	26.72/0.8134	26.26/0.7984	26.80/0.8149
SPMCS	23.29/0.6385	27.86/0.8237	28.16/0.8421	28.71/0.8515	27.79/0.8433	28.95/0.8608
UDM10	28.47/0.8253	36.26/0.9438	37.09/0.9522	37.36/0.9545	36.80/0.9511	37.93/0.9575
RED4	26.14/0.7292	27.93/0.7997	29.71/0.8356	29.50/0.8319	29.45/0.8285	29.82/0.8383

Table 2. Quantitative comparison of the ablation study of the temporal grouping attention module. ‘RG’ represents the reference group. ‘√’ indicates the addition of this module and the bold portion indicates the best performance.

Method	RG	TAM	Param (M)	Vid4	SPMCS	UDM10	RED4
1			0.771	26.35/0.8027	28.21/0.8454	37.12/0.9536	29.62/0.8338
2		√	0.815	26.39/0.7997	27.99/0.8354	37.32/0.9534	29.64/0.8339
3	√		0.834	26.53/0.8091	28.16/0.8477	37.40/0.9549	29.67/0.8341
4	√	√	0.878	26.80/0.8149	28.95/0.8608	37.93/0.9575	29.82/0.8383

Table 3. Quantitative comparison of the ablation study of the temporal grouping attention module. The bold portion indicates the best performance.

	RGAN-N	RGAN-S	RGAN
Param (M)	0.937	0.877	0.878
Vid4	26.72/0.8124	26.40/0.800	26.80/0.8149
SPMCS	28.84/0.8581	28.30/0.8419	28.95/0.8608
SPMCS	37.77/0.9563	37.06/0.9520	37.93/0.9575
SPMCS	29.72/0.8358	29.57/0.8312	29.82/0.8383

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, Y.; Li, G. A Lightweight Recurrent Grouping Attention Network for Video Super-Resolution. Sensors 2023, 23, 8574. https://doi.org/10.3390/s23208574

AMA Style

Zhu Y, Li G. A Lightweight Recurrent Grouping Attention Network for Video Super-Resolution. Sensors. 2023; 23(20):8574. https://doi.org/10.3390/s23208574

Chicago/Turabian Style

Zhu, Yonggui, and Guofang Li. 2023. "A Lightweight Recurrent Grouping Attention Network for Video Super-Resolution" Sensors 23, no. 20: 8574. https://doi.org/10.3390/s23208574

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Recurrent Grouping Attention Network for Video Super-Resolution

Abstract

1. Introduction

2. Related Work

2.1. Single-Image Super-Resolution

2.2. Video Super-Resolution

3. Our Method

3.1. Overview

3.2. Forward/Backward Feature Extraction Module

3.3. Temporal Grouping Attention Module

3.4. Attention Supplementation Module

3.5. Feature Reconstruction Module

4. Experiments

4.1. Training Datasets and Details

4.2. Comparison with State-of-the-Art Methods

4.3. Ablation Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI