A Lightweight Reconstruction Model via a Neural Network for a Video Super-Resolution Model

Tang, Xinkun; Xu, Ying; Ouyang, Feng; Zhu, Ligu

doi:10.3390/app131810165

Open AccessArticle

A Lightweight Reconstruction Model via a Neural Network for a Video Super-Resolution Model

¹

Academy of Broadcasting Science, No. 2, Beijing 100866, China

²

School of Data Science and Media Intelligence, Communication University of China, Beijing 100024, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(18), 10165; https://doi.org/10.3390/app131810165

Submission received: 26 July 2023 / Revised: 29 August 2023 / Accepted: 8 September 2023 / Published: 9 September 2023

(This article belongs to the Special Issue Applications of Video, Digital Image Processing and Deep Learning)

Download

Browse Figures

Versions Notes

Abstract

:

Super-resolution in image and video processing has been a challenge in computer vision, with its progression creating substantial societal ramifications. More specifically, video super-resolution methodologies aim to restore spatial details while upholding the temporal coherence among frames. Nevertheless, their extensive parameter counts and high demand for computational resources challenge the deployment of existing deep convolutional neural networks on mobile platforms. In response to these concerns, our research undertakes an in-depth investigation into deep convolutional neural networks and offers a lightweight model for video super-resolution, capable of reducing computational load. In this study, we bring forward a unique lightweight model for video super-resolution, the Deep Residual Recursive Network (DRRN). The model applies residual learning to stabilize the Recurrent Neural Network (RNN) training, meanwhile adopting depth-wise separable convolution to boost the efficiency of super-resolution operations. Thorough experimental evaluations reveal that our proposed model excels in computational efficiency and in generating refined and temporally consistent results for video super-resolution. Hence, this research presents a crucial stride toward applying video super-resolution strategies on devices with resource limitations.

Keywords:

depth-separable convolution (DSC); Recursive Residual Network (RRN); Super Resolution (SR)

1. Introduction

As the information society ceaselessly advances, video has become indispensable as a medium for information procurement. Its applications permeate numerous sectors, such as the consumer industry, healthcare, and defense. The demand for superior video quality experiences a continual upsurge in these contexts. Nevertheless, improved video quality mandates elevated transmission bandwidth, sophisticated acquisition apparatus, extensive storage infrastructure, and swift processing speeds. Given financial and feasibility considerations, such prerequisites often present formidable challenges in diverse real-world situations, leading to video degradation or incomplete surveillance, thereby significantly impacting user experience. Thus, reconstructing pristine, high-fidelity videos from these inferior video signals or incomplete observations remains a focal point of research within image processing. Currently, prevalent video reconstruction challenges encompass the recovery of high-resolution video signals—through compressed sensing or super-resolution—from scantily captured or low-resolution video signals, aesthetically appealing video signals from blurred video signals, the rectification of inferior video signals—such as dark light enhancement or fog—and videos boasting a high frame rate—known as video insertion.

Super Resolution (SR) operates as a crucial function in computer vision. Its main task is to transform a low-resolution (LR) image into a high-resolution (HR) variant, simultaneously enriching the image’s fine details to optimize the aesthetic appeal. SR primarily deals with two entities—images and videos. Image-based SR technology focuses on amplifying the subtle details whilst improving the resolution of LR images. On the other hand, the technology for video SR must guarantee a seamless flow between successive frames while also meeting the requirements of image SR. If not, the outcome might be a disjointed video sequence, which could negatively affect the user’s viewing experience.

In recent years, research into SR technology has been accelerating worldwide, partially due to the advent of deep learning technology [1], which, with improved hardware computational performance and the emergence of high-resolution displays, necessitates matching resolution. Amid the swift progression of digital media and the Internet’s pervasive influence, video content has established itself as an indispensable part of daily life and information dissemination. High-definition, superior-quality video presentations are integral to delivering an enhanced viewing experience and catering to user needs. However, the processing and transmission of high-resolution video present significant challenges to computational and bandwidth resources. Traditional video super-resolution technology often fails to satisfy real-time performance and resource efficiency criteria, particularly in mobile devices and real-time applications. The lightweight video super-resolution technology design caters to real-time applications and mobile devices’ needs, providing superior video quality while concurrently diminishing the computational burden and bandwidth usage.

Existing SR algorithms can be divided into two categories: one is traditional methods, and the other is deep-learning-based methods. Traditional SR algorithms include bilinear interpolation, bicubic interpolation, sparse representation [2], Bayesian [3], Bayesian-Motion Blur [4], and so on. The deep-learning-based method is the hottest research direction, and the SR algorithm implemented by this technology is far better than the traditional SR technology. Many image and video SR algorithms based on deep learning have been proposed, and they try to improve the SR effect from different problem perspectives, including the original data itself, network structure, loss function, and so on. The first work to apply deep learning in the field of SR is the Super Resolution Convolutional Neural Network (SRCNN) [5] algorithm, which consists of three convolutional layers. The reconstruction achieves the purpose of restoring LR images through feature extraction and non-linear mapping. Later, because of the current work, many excellent image SR algorithms were proposed from the perspective of the network input form, network structure, loss function, and utilization degree of information, including Fast Super-Resolution by CNN (FSRCNN) [6], Very Deep Super Resolution (VDSR) [7], Efficient Sub-Pixel Convolutional Neural Network (ESPCN) [8], Enhanced Deep Residual Networks (EDSR) [9], Residual Dense Network (RDN) [10], Residual Channel Attention Networks (RCAN) [11], Dual Regression Networks (DRN) [12], etc. The development of the image super-resolution algorithm usually accompanies the development of the video SR algorithm. After the image SR technology based on deep learning is proposed, the corresponding video SR algorithm is also put forward successively, including Deep Draft-Ensemble Learning (Deep-DE) [13] and Video Super-Resolution with Convolutional Neural Networks (VSRnet) [14]. SRCNN improves the VSRnet method and uses three convolution layers, and the input is changed from the original single image to a multi-frame continuous image. After this, more and more excellent algorithms have been proposed, including Video Efficient Sub-Pixel Convolutional Neural Network (VESPCN) [15], Dynamic Upsampling Filters (DUF) [16], Detail-Revealing Deep Video Super-Resolution (DRVSR) [17], Temporally Deformable Alignment Network (TDAN) [18], Recurrent Back-Projection Network (RBPN) [19], Video Restoration With Enhanced Deformable Convolutional Networks (EDVR) [20], etc. At present, the focus of video SR algorithm research is on how to make more effective use of information between the video frames, as far as possible, to capture the target frame SR helpful information, such as high-frequency details, and therefore put forward a variety of strategies, including 3D convolution, non-local network, deformable convolution, and optical flow-based Motion Estimation and Motion Compensation (MEMC) technology. Among the once popular flow-based approaches, the estimated optical flow motion information needs to be more accurate due to its application limitation, thus affecting the subsequent SR process, ultimately resulting in performance limitations. Interframe utilization based on deformable convolution has attracted much attention recently due to its excellent performance, including EDVR, Video Enhancement and Super-Resolution Network (VESRnet) [21], and TDAN. This is because the way based on deformable convolution is the integration of multiple optical flows. If it is a convolution kernel, it is the result of the nine optical streams achieving complementary integration with each other, while the optical flow-based alignment mode is based on the particular case of the deformable convolution with the convolution kernel of 11; thus, more accurate interframe information can be obtained based on deformable convolution so that the performance will be even better. This theoretical explanation is elaborated on in [22]. In addition, some scholars migrate knowledge from other fields to SR problems, such as semantic segmentation and generating adversarial networks, and the proposed algorithms include Super-Resolution Generative Adversarial Network (SRGAN) [23], Enhanced Super-Resolution Generative Adversarial Networks (ESRGAN) [24], Video Super-Solution Residual Neural Network (VSRResNet) [25], Temporal Coherence Generative Adversarial Networks (TecoGAN) [26], etc. Xiaoxin Guo et al. [27] proposed a channel and spatial attention module (CSAM) dedicated to SISR to pay attention to the image details and different semantic levels in the feature map. Tian et al. [28] proposed CNN-based asymmetric image Super Resolution network (ACNet), which highlights local key features through an asymmetric convolution structure composed of 3 × 1, 1 × 3, and 3 × 3 convolutions, reducing information redundancy and accelerating training speed. At present, there is a lot of work. Although the algorithm performs well in the numerical indicators it is not as good as the numerical indicators, and the video is finally for the audience to watch, so there is a lot of research work, such as TecoGAN, to explore more reasonable evaluation indicators.

In this study, we evaluate various time modeling methods’ effectiveness for VSR tasks, employing a consistent loss function (L1 loss) and training data. This paper introduces a lightweight video super-resolution Recursive Residual Network (DRRN) that incorporates 3D residual connections into the recursive network’s hidden state and deep separable convolution. In the proposed hidden state, the identity branch serves a dual purpose: it not only transmits abundant image details from one layer to the next, but also assists in mitigating gradient disappearance during RNN training. To further amplify the potency of DRRN, this study crafts a dual-channel model featuring resblock, dubbed as DBRRN. Relative to DRRN, the performance of this model has seen a notable enhancement.

2. Materials and Methods

In this section, we explore the intricate layout of the complete system pipeline and the temporal modeling methodology. The system is structured around two primary elements: a temporal modeling network that takes continuous frames as inputs and amalgamates them with the reference frame and the network’s loss function, which subtly harnesses motion data.

We perform a comprehensive examination and comparison of three temporal modeling techniques. These methods are visually represented in Figure 1a–c, respectively. Figure 1c provides an intricate portrayal of the Dual-channel model with ResBlock Recursive Network (DBRRN) architecture.

However, as a text-based AI, we cannot visualize or interact with actual images or figures. Hence, the referenced figures cannot be described in detail here. In the context of an actual paper, this section would likely provide further elaboration on the structure and functionality of the depicted networks, detailing the specificities of each temporal modeling method and explaining the design choices made in the construction of the DBRRN. Every facet of the architecture will be elaborated, ranging from the input to the hidden layers, including the types of layers incorporated, their interconnections, and the specific functions they execute.

2.1. Recurrent Neural Network (RNN)

A Neural Network (NN) is also commonly known as an Artificial Neural Network (ANN). A recurrent neural network (RNN) is a type of recursive neural network that takes sequence data as input, recursion in the direction of sequence evolution, and all nodes (recursive neural network) are connected in a chain.

An RNN is a special neural network structure, which is based on the idea that human cognition is based on past experience and memory. It is different from Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs), in that it not only considers the input of the previous moment, but also gives the network a memory function of the previous content.

RNNs are called circulatory neural networks, i.e., the current output of a sequence is also related to the previous output. The specific manifestation is that the network will remember the previous information and apply it to the calculation of the current output, i.e., the nodes between the hidden layers are no longer connected but connected, and the input of the hidden layer includes not only the output of the input layer, but also the output of the hidden layer at the previous moment.

However, RNNs in VSR also suffer from vanishing gradients, like many other video processing tasks. To solve this problem, we propose a new recursive network called an RRN, which employs residual mapping between layers with identity skip connections. This design ensures a smooth flow of information and enables long-term preservation of texture information, making it easier for RNNs to handle longer sequences at the same time reduce the risk of gradient vanishing in training.

2.2. Recursive Residual Network (RRN)

The RRN [29], a significant model within deep learning, offers a powerful mechanism for reconstructing super-resolution images. By leveraging the unique architecture of recursive and residual connections, the RRN model proficiently enhances the detail and lucidity of low-resolution images through several iterative processes and progressive layer enhancement.

The RRN model’s quintessential philosophy revolves around super-resolution reconstruction, achieved by recursively learning and iteratively updating residual information. The structure of the model incorporates numerous stacked recursive units that consist of a CNN and a residual learning module. Within each recursive unit, the CNN focuses on feature extraction and upscaling low-resolution images, whereas the residual learning module involves acquiring and transmitting residual data. The model incrementally augments the image resolution and quality through the repeated accumulation of these units.

Conventionally, the training of the RRN model adopts an end-to-end approach, creating a direct correlation from low-resolution images to their high-resolution equivalents. The model’s parameters are optimized and revised using a pixel-level loss function (like mean square error), which gauges the deviation between the super-resolution image produced and the original high-resolution image. A perceptual loss function, such as those predicated on perceptual distance (including perceptual loss and content loss), can be utilized to boost the visual fidelity of the reconstructed images.

The RRN model has accomplished substantial breakthroughs within the image super-resolution domain, substantiating the model’s aptitude to enhance image detail restoration and visual quality through recursive learning and residual connections. While maintaining computational efficiency, it delivers high-quality super-resolution reconstruction. Additionally, the RRN model has the potential to be collaboratively fused with other technologies and models, like attention mechanisms and Generative Adversarial Networks (GANs), thereby further escalating the super-resolution outcome.

In conclusion, the RRN model is an efficacious deep learning model for image super-resolution reconstruction tasks. Through its recursive and residual connection structure, the model progressively enhances the detail and clarity of images, offering essential technical support for image processing and application endeavors.

2.3. Depth-Separable Convolution

The DSC has been popularized by notable models like the Extreme version of Inception (Xception) [30] and MobileNet [31]. These two models developed by Google are landmark achievements that feature DSC prominently. As shown in Figure 1, the DSC is comprised of two elements: Depthwise Convolution and Pointwise Convolution.

The operation of Depthwise Convolution is relatively simple. It employs a distinct convolution kernel for each channel of the input feature map. The outcomes of all convolution kernels are subsequently concatenated to yield the final output. This procedure is illustrated in the Depthwise Convolution section of Figure 2.

DSC is a specialized convolution operation within CNNs. It offers a more lightweight approach to convolution, boasting fewer parameters and reduced computational requirements than traditional convolution operations.

DSC initiates the process by performing feature extraction on the input images, executing convolution operations independently for each input channel. Subsequently, Pointwise Convolution is employed to amalgamate the output results from Depthwise Convolution. This process weights and sums the results from each channel to produce the final output results.

The primary advantage of DSC over traditional convolution operations is its significant reduction in the number of parameters and computational demands. This efficiency allows DSC to maintain model accuracy while significantly improving the model’s speed. DSC is particularly suited for scenarios with limited computational resources, including mobile and embedded devices. This makes it a popular choice in the design of CNNs.

An essential characteristic of Depthwise Convolution is that the quantity of output channels is equal to the number of convolution kernels. Given that each channel in Depthwise Convolution leverages a single convolution kernel, the number of output channels posts the convolution operation is also 1. Hence, if the count of channels in the input feature map is N (as illustrated in Figure 2), executing discrete operations using a single convolution kernel for N channels would produce one feature map per N channels. Concatenating these N feature maps culminates in an output feature map consisting of N channels.

2.4. Network Design

Three variants of deep neural networks were scrutinized: (1) RNN, (2) DRRN, and (3) DBRRN. These networks accept a video sequence as input and utilize a stack of three-dimensional convolutional layers to extract spatiotemporal data from the video sequence. The RNN incorporates fewer frames as hidden state inputs and processes extensive video sequences cyclically.

This section offers an in-depth configuration of the complete system pipeline and temporal modeling methodologies. The entire system is bifurcated into two segments: one segment is the time modeling network that takes continuous frames as input and amalgamates them with reference frames. In contrast, the other segment optimizes the network’s Loss function by implicitly employing motion information.

RRN: The Residual Recursive Network (RRN) addresses the issue of gradient vanishing by adopting the residual mapping between layers with identity skip connections. This configuration guarantees a seamless information flow and enables prolonged texture information retention. Consequently, it simplifies the RNN’s management of extended sequences while mitigating the risk of gradient disappearance during training. At time step t, the RRN utilizes the ensuing equation to generate the output

h_{t}

and to formulate for the next time step t + 1:

\hat{x_{0}} = σ (W_{c o n v 2 D} {[I_{t - 1}, I_{t}, o_{t - 1}, h_{t - 1}]} \hat{x_{k}} = g (\hat{x_{k - 1})} + F (\hat{x_{k - 1})}, k ϵ [1, K] h_{t} = σ (W_{c o n v 2 D} \hat{{x_{k}}}) o_{t} = W_{c o n v 2 D} \hat{{x_{k}}}

(1)

where

g (\hat{x_{k - 1})}

symbolizes the identity mapping in the identity residue block, and

F (\hat{x_{k - 1})}

represents the residue mapping in layer k.

DRRN: In DRRN, the convolution present in RRN is substituted with depthwise separable convolution. This design modification can curtail the model parameter scope and enhance computational time.

\hat{x_{0}} = σ (W_{Depthwise Separable c o n v} {[I_{t - 1}, I_{t}, o_{t - 1}, h_{t - 1}]} \hat{x_{k}} = g (\hat{x_{k - 1})} + F (\hat{x_{k - 1})}, k ϵ [1, K] h_{t} = σ (W_{Depthwise Separable c o n v} \hat{{x_{k}}}) o_{t} = W_{Depthwise Separable c o n v} \hat{{x_{k}}}

(2)

DBRRN: Initially, the image is reconfigured by the Pixel Unshuffle layer, which transposes the data from a spatial dimension to a depth dimension. Subsequently, the input data is concatenated and passed into the convolution (Conv) layer, followed by a Rectified Linear Unit (ReLU) [8]. The data then proceeds into a series of stacked Residual Blocks (ResBlock) [12]. Each ResBlock consists of Conv-ReLU-Conv layers, with a skip connection combining the output and input. In one branch, the elements within the ResBlocks of the dual channel are delivered to the Conv-ReLU layer as the output feature. In the other branch, the dual-channel features are processed through a Conv layer and reshaped by the Shuffle layer.

2.5. Image Quality Evaluation

Evaluating the quality of super-resolution images is crucial to any super-resolution task. The techniques mentioned are commonly used for this purpose, and each has its unique advantages:

Peak Signal-to-Noise Ratio (PSNR): This quantitative metric is widely recognized and utilized. It is simple to calculate and offers a score for straightforward comparisons across diverse methods or datasets. PSNR’s basis is the mean squared error (MSE) between the pixel values of the original and the reconstructed visuals. A higher PSNR score signifies a closer resemblance to the original image, thus implying superior image quality. Nevertheless, it is noteworthy that PSNR presumes that the utmost important factor of image quality is the faithfulness to the precise pixel values of the original image. This assumption may not invariably correspond with human perception.

Structural Similarity Index (SSIM): SSIM is another commonly used measure that compares the similarity between two images. Unlike PSNR, SSIM considers changes in structural information, contrast, and brightness, critical factors in human visual perception. SSIM scores range from −1 to 1, where one means the two images are identical. This makes SSIM a more reliable measure of perceived image quality.

It is also worth noting that measures like Multi-Scale Structural Similarity (MS-SSIM), Visual Information Fidelity (VIF), and others can also be used depending on the specific requirements of the task. The selection of the appropriate measurement often hinges on the characteristics of the images under consideration and the particular facets of image quality deemed crucial in a specific scenario.

3. Results

3.1. Datasets

3.1.1. Vimeo-90k

We leveraged the Vimeo-90k [32] as our training repository in the present investigation. This vast dataset contains high-quality video data, making it ideal for an array of video processing tasks at the low level, including video noise mitigation, deblocking, video frame interpolation, and video super-resolution. It includes 9000 assorted video clips derived from diverse scenarios and activities. Hence, its versatility renders it highly suitable for training our proposed model. For our purposes, we utilized the Septuplet subset from this dataset, which is composed of 91,701 groups of video clips, each containing seven frames with a resolution of 256 × 448. To generate our training dataset, we executed a Gaussian blur (σ = 1.6) on the high-resolution frames and downscaled them by a factor of 4, resulting in low-resolution patches of dimensions 64 × 64.

3.1.2. Vid 4

We chose the Vid 4 [3] dataset for evaluating our model. Vid 4 is a widely recognized video super-resolution test set consisting of four scenes with unique motion and occlusion characteristics. The four video sequences in this dataset are ‘calendar’ (41 frames, 576 × 720 resolution), ‘city’ (34 frames, 576 × 704 resolution), ‘foliage’ (49 frames, 480 × 720 resolution), and ‘walk’ (47 frames, 480 × 720 resolution).

3.1.3. SPMCS

We also utilized the SPMCS [17] dataset for evaluation purposes. This dataset encompasses 30 video sequences, each comprising 31 successive frames and sporting a resolution of 540 × 960. It includes input images subsampled at x2, x3, and x4 factors, in addition to high-resolution raw images. The variety of subsampling rates and the higher-resolution frames make this an ideal dataset for evaluating our super-resolution model.

3.1.4. UDM10

Lastly, we used the UDM10 [33] dataset for further evaluation. This dataset is a commonly used test set for video super-resolution tasks, and contains 10 video sequences, each consisting of 32 consecutive frames at a resolution of 720 × 1272.

The datasets, encompassing a wide range of scenes, resolutions, and frame rates, present a thorough platform for assessing our super-resolution model’s performance under different circumstances. The ensuing section will present the findings derived from these evaluations.

3.2. Experimental Settings and Training Procedures

Three distinct models, named RNN, DRRN, and DBRRN, were utilized in our experiments. The explicit parameters for each model are detailed in Table 1.

In order to facilitate an unbiased and uniform comparison across models, we harmonized several settings in all conducted experiments. Notably, the channel size for every model was fixed at 128, and we deployed five blocks as latent states in the RNN-based models.

These blocks each follow the same architecture: an initial convolution layer, a subsequent ReLU activation layer, and a concluding convolution layer. The channel size of the convolution layer was designated as 128.

We set the prior estimate to zero during the onset time step (t₀). The learning rate for the RNN models’ training was primarily fixed at 1 × 10⁻⁴, which was then attenuated by 0.1 for every 60 epochs until the 70th epoch.

Model optimization was accomplished by applying the Adam optimizer [17], characterized by β1 equating to 0.9, β2 being 0.999, and a weight decay of 5 × 10⁻⁴. The model’s training was supervised using the pixel-wise L1 loss function. This function is an efficient tool for diminishing the disparities between the anticipated and actual pixel values, boosting the model’s capacity to generate high-resolution images closely reflecting the original ones.

These configurations and training methods, which have proven to be efficacious in prior research, lay a robust groundwork for comparing the proposed models’ performance. A detailed examination of the experimental outcomes is covered in the upcoming sections.

4. Discussion

This section introduces a comparative study of three temporal modeling techniques—RNN, DRRN, and DBRRN—applied to three different datasets: Vid 4, SPMCS, and UMD10. We also presents the trade-off between runtime and accuracy in Table 1.

The gathered quantitative and qualitative results are summarized in Table 1. In this table, we also highlight the balance between the execution time and accuracy of the models. Notably, the DRRN and DBRRN models demonstrate superior computational efficiency relative to the RRN-based methodologies, attaining pleasing outcomes with fewer parameters. The runtime of the DRRN is 15 ms longer than the RNN.

Figure 3 provides a visual comparison of the impacts of RNN, DRRN, and DBRRN. The comparison vividly illustrates that both DRRN and DBRRN effectively capture the prominent image features, yielding superior outcomes in contrast to their low-resolution equivalents. The outcomes detailed in Table 1 demonstrate that although DRRN and DBRRN do not exhibit a substantial performance improvement when compared directly to RNN, they do significantly reduce the computational requirements and execution duration.

Figure 4 illustrates the PSNR and Loss values of the three models on the Vid 4 dataset. A slight improvement of DBRRN over DRRN is noticeable, although the difference from RNN is marginal. However, as discernible from Figure 3, this minor difference does not significantly affect the perceptual quality of the output.

Table 2 provides an overview of the experimental outcomes when using the DBRRN model on the SPMCS and UMD10 datasets. It is noticeable that DBRRN’s performance varies across different categories. Specifically, the model performs well on ‘car’, ‘hk’, and ‘jvc’ categories, while demonstrating suboptimal results on ‘hsdclub’ and ‘hitachi_see’. This discrepancy might be associated with the original image’s clarity. f. Therefore, considering the quality of the original video dataset is crucial when applying this model.

In summary, all three models exhibit competent performance in super-resolution tasks. DRRN and DBRRN models offer a promising compromise between accuracy and computational efficiency, making them ideal for applications with limited computational resources. Further investigation into these models may yield ways to enhance their performance.

In order to compare the difference in user experience between three models, we invited more than thirty people to experience games with different models. Table 3 shows that users in the DBRRN score higher. Compared with RRN, the average score is 0.2, but the visual subjective evaluation is not much different.

5. Conclusions

The field of video super-resolution has gained substantial attention from researchers and industrial professionals due to its significant impact on various applications. To ensure fair evaluation, we utilized the Vimeo-90k dataset to train all models in this study and employed consistent downsampling filters and loss functions.

In this research, we systematically explored different temporal modeling techniques for video super-resolution tasks while maintaining the L1 loss function and consistent training data. Our primary focus was on incorporating 3D residual connections into hidden states and employing deeply separable convolution in recursive networks.

Our notable contribution introduces a Deeply Residual Recursive Network (DRRN) optimized for video super-resolution. The network’s design, particularly the distinct structure of the hidden state, facilitates the transfer of detailed image information between layers through an identity branch. This approach enhances the representation of detailed image features and addresses the common challenge of vanishing gradients during training of Recurrent Neural Networks.

To enhance the performance of DRRN, we proposed the Dual-Branch Residual Recursive Network. Experimental results demonstrated that the DBRRN model outperformed the DRRN model, showcasing a significant improvement in video super-resolution performance.

This study paves the way for more advanced video super-resolution models that strike a balance between computational efficiency and performance. Future research could concentrate on further optimizing and evaluating these models across diverse settings and applications.

Author Contributions

Conceptualization, X.T. and F.O.; Methodology, Y.X.; Validation, Y.X.; Formal analysis, X.T.; Writing—original draft, Y.X.; Writing—review & editing, X.T. and F.O.; Project administration, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Academy of Broadcasting Science (No. JBKY20230160).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

LeCun, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Wright, J.; Huang, T.S.; Ma, Y. Image Super-Resolution Via Sparse Representation. IEEE Trans. Image Process. 2010, 19, 2861–2873. [Google Scholar] [CrossRef] [PubMed]
Liu, C.; Sun, D. On Bayesian Adaptive Video Super Resolution. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 346–360. [Google Scholar] [CrossRef] [PubMed]
Ma, Z.; Liao, R.; Tao, X.; Xu, L.; Jia, J.; Wu, E. Handling Motion Blur in Multi-Frame Super-Resolution. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5224–5232. [Google Scholar]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a Deep Convolutional Network for Image Super-Resolution. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 184–199. [Google Scholar]
Dong, C.; Loy, C.C.; Tang, X. Accelerating the Super-Resolution Convolutional Neural Network. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 391–407. [Google Scholar]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 1132–1140. [Google Scholar]
Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual Dense Network for Image Super-Resolution. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2472–2481. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 294–310. [Google Scholar]
Guo, Y.; Chen, J.; Wang, J.; Chen, Q.; Cao, J.; Deng, Z.; Xu, Y.; Tan, M. Closed-Loop Matters: Dual Regression Networks for Single Image Super-Resolution. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5406–5415. [Google Scholar]
Liao, R.; Tao, X.; Li, R.; Ma, Z.; Jia, J. Video Super-Resolution via Deep Draft-Ensemble Learning. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 531–539. [Google Scholar]
Kappeler, A.; Yoo, S.; Dai, Q.; Katsaggelos, A.K. Video Super-Resolution With Convolutional Neural Networks. IEEE Trans. Comput. Imaging 2016, 2, 109–122. [Google Scholar] [CrossRef]
Caballero, J.; Ledig, C.; Aitken, A.; Totz, J.; Wang, Z.; Shi, W. Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2848–2857. [Google Scholar]
Jo, Y.; Oh, S.W.; Kang, J.; Kim, S.J. Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3224–3232. [Google Scholar]
Tao, X.; Gao, H.; Liao, R.; Wang, J.; Jia, J. Detail-Revealing Deep Video Super-Resolution. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4482–4490. [Google Scholar]
Tian, Y.; Zhang, Y.; Fu, Y.; Xu, C. TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3357–3366. [Google Scholar]
Haris, M.; Shakhnarovich, G.; Ukita, N. Recurrent Back-Projection Network for Video Super-Resolution. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–18 June 2019; pp. 3892–3901. [Google Scholar]
Wang, X.; Chan, K.C.; Yu, K.; Dong, C.; Change Loy, C. EDVR: Video Restoration With Enhanced Deformable Convolutional Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 1954–1963. [Google Scholar]
Chen, J.; Tan, X.; Shan, C.; Liu, S.; Chen, Z. VESR-Net: The Winning Solution to Youku Video Enhancement and Super-Resolution Challenge. arXiv 2020, arXiv:2003.02115 2020. [Google Scholar]
Chan, K.C.K.; Wang, X.; Yu, K.; Dong, C.; Loy, C.C. Understanding Deformable Alignment in Video Super-Resolution. AAAI Conf. Artif. Intell. 2020, 35, 973–981. [Google Scholar] [CrossRef]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 105–114. [Google Scholar]
Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Change Loy, C. ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks. In Proceedings of the European Conference on Computer Vision Workshops, Munich, Germany, 8–14 September 2018; pp. 63–79. [Google Scholar]
Lucas, A.; Lopez-Tapia, S.; Molina, R.; Katsaggelos, A.K. Generative Adversarial Networks and Perceptual Losses for Video Super-Resolution. IEEE Trans. Image Process. 2019, 28, 3312–3327. [Google Scholar] [CrossRef] [PubMed]
Chu, M.; Xie, Y.; Mayer, J.; Leal-Taixé, L.; Thuerey, N. Learning Temporal Coherence via Self-Supervision for GAN-based Video Generation. ACM Trans. Graphics 2018, 39, 75. [Google Scholar] [CrossRef]
Guo, X.; Tu, Z.; Li, G.; Shen, Z.; Wu, W. A novel lightweight multi-dimension feature fusion network for single-image super-resolution reconstruction. Vis. Comput. Sci. 2023, 26, 1–12. [Google Scholar] [CrossRef]
Tian, C.W.; Xu, Y.; Zuo, W.M.; Lin, C.W.; Zhang, D. Asymmetric CNN for Image Superresolution. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 3718–3730. [Google Scholar] [CrossRef]
Zhu, F.; Jia, X.; Wang, S. Revisiting Temporal Modeling for Video Super-resolution. arXiv 2020, arXiv:2008.05765. [Google Scholar]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Xiang, X.; Tian, Y.; Zhang, Y.; Fu, Y.; Allebach, J.P.; Xu, C. Zooming Slow-Mo: Fast and Accurate One-Stage Space-Time Video Super-Resolution. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Yi, P.; Wang, Z.; Jiang, K.; Jiang, J.; Ma, J. Progressive Fusion Video Super-Resolution Network via Exploiting Non-Local Spatio-Temporal Correlations. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019. [Google Scholar]

Figure 1. Schematic illustration of three commonly used temporal modeling frameworks: (a) the RNN, (b) the DRRN, and (c) the DBRRN.

Figure 2. The structure of Depthwise Separable Convolution.

Figure 3. Qualitative comparison of the SPMCS datasets.

Figure 4. The PSNR and Loss of the three models on Vid4. (a) PSNR, (b) loss.

Table 1. It presents a comparative analysis of PSNR values, obtained from the Vid4 dataset. The Y symbol represents evaluations conducted on the luminance channel.

Method	RNN	DRRN	DBRRN
Input Frames	recurrent	recurrent	recurrent
Param. [M]	7.204	2.93	2.94
FLOPs [GMAC]	193	108	120
Runtime [ms]	45	30	32
Vid4 (Y)	27.69	26.78	27.01
SPMCS (Y)	29.89	28.89	29.10
UDM10 (Y)	30.33	29.55	30.01

Table 2. Experimental results of DBRRN on SPMCS and UMD10 dataset.

	Car	hdclub	hitachi_isee	hk	jvc
SSIM	0.80	0.58	0.69	0.80	0.82
PSNR	28.05	21.12	22.19	28.32	27.02
Time	0.0475	0.0467	0.048	0.0475	0.053

Table 3. Subjective evaluation on the three models.

Score	RNN	DRRN	DBRRN
car	2.6	2.6	2.7
hdclub	2.0	2.0	2.2
Hitachi_isee	2.4	2.5	2.5
hk	2.6	2.6	2.7
jvc	2.7	2.7	2.8
Average	2.46	2.48	2.58

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, X.; Xu, Y.; Ouyang, F.; Zhu, L. A Lightweight Reconstruction Model via a Neural Network for a Video Super-Resolution Model. Appl. Sci. 2023, 13, 10165. https://doi.org/10.3390/app131810165

AMA Style

Tang X, Xu Y, Ouyang F, Zhu L. A Lightweight Reconstruction Model via a Neural Network for a Video Super-Resolution Model. Applied Sciences. 2023; 13(18):10165. https://doi.org/10.3390/app131810165

Chicago/Turabian Style

Tang, Xinkun, Ying Xu, Feng Ouyang, and Ligu Zhu. 2023. "A Lightweight Reconstruction Model via a Neural Network for a Video Super-Resolution Model" Applied Sciences 13, no. 18: 10165. https://doi.org/10.3390/app131810165

APA Style

Tang, X., Xu, Y., Ouyang, F., & Zhu, L. (2023). A Lightweight Reconstruction Model via a Neural Network for a Video Super-Resolution Model. Applied Sciences, 13(18), 10165. https://doi.org/10.3390/app131810165

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Reconstruction Model via a Neural Network for a Video Super-Resolution Model

Abstract

1. Introduction

2. Materials and Methods

2.1. Recurrent Neural Network (RNN)

2.2. Recursive Residual Network (RRN)

2.3. Depth-Separable Convolution

2.4. Network Design

2.5. Image Quality Evaluation

3. Results

3.1. Datasets

3.1.1. Vimeo-90k

3.1.2. Vid 4

3.1.3. SPMCS

3.1.4. UDM10

3.2. Experimental Settings and Training Procedures

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI