Next Article in Journal
Reconstruction of a Car–Running Pedestrian Accident Based on a Humanoid Robot Method
Next Article in Special Issue
Beyond Human Detection: A Benchmark for Detecting Common Human Posture
Previous Article in Journal
Cortical Response Variation with Social and Non-Social Affective Touch Processing in the Glabrous and Hairy Skin of the Leg: A Pilot fMRI Study
Previous Article in Special Issue
Quality Control of Carbon Look Components via Surface Defect Classification with Deep Neural Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Real-Time Video Super-Resolution with Spatio-Temporal Modeling and Redundancy-Aware Inference

School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
*
Author to whom correspondence should be addressed.
Sensors 2023, 23(18), 7880; https://doi.org/10.3390/s23187880
Submission received: 2 August 2023 / Revised: 3 September 2023 / Accepted: 12 September 2023 / Published: 14 September 2023
(This article belongs to the Special Issue Artificial Intelligence in Imaging Sensing and Processing)

Abstract

:
Video super-resolution aims to generate high-resolution frames from low-resolution counterparts. It can be regarded as a specialized application of image super-resolution, serving various purposes, such as video display and surveillance. This paper proposes a novel method for real-time video super-resolution. It effectively exploits spatial information by utilizing the capabilities of an image super-resolution model and leverages the temporal information inherent in videos. Specifically, the method incorporates a pre-trained image super-resolution network as its foundational framework, allowing it to leverage existing expertise for super-resolution. A fast temporal information aggregation module is presented to further aggregate temporal cues across frames. By using deformable convolution to align features of neighboring frames, this module takes advantage of inter-frame dependency. In addition, it employs a hierarchical fast spatial offset feature extraction and a channel attention-based temporal fusion. A redundancy-aware inference algorithm is developed to reduce computational redundancy by reusing intermediate features, achieving real-time inferring speed. Extensive experiments on several benchmarks demonstrate that the proposed method can reconstruct satisfactory results with strong quantitative performance and visual qualities. The real-time inferring ability makes it suitable for real-world deployment.

1. Introduction

Video is a widely used multimedia format combining image frames with audio. However, the quality of video often is limited by factors such as capture, storage, and transmission [1]. Video super-resolution (SR) techniques aim to reconstruct high-resolution (HR) frames from low-resolution (LR) counterparts. Similarly, image SR models focus on enhancing the resolution of LR images. Video SR can be seen as an extension of single-image SR, which leverages spatial information along with temporal information from LR frames. It has diverse applications in video displaying [2], video surveillance [3], and satellite imagery [4].
Recently, deep learning-based methods have shown promising performance in video SR tasks [2] and image SR tasks [5]. These video SR models can be categorized into two groups: (1) models without image SR techniques and (2) models incorporating image SR techniques. The first category has to explore alternative approaches for spatial information, such as estimating upsampling filters [6] or task-specific optical flow [7]. Although these methods achieve good performance, they have limited spatial information modeling capacity. In contrast, the second category benefits from image SR insights for spatial reconstruction [8,9,10]. However, they only incorporate specific components from image SR models, which creates a barrier to fully harnessing the potential of well-trained parameters. Thereby, there is room for performance improvement. Different from existing video SR models [11,12] that only incorporate specific components from the image SR model, the proposed method employs a full image SR model for better spatial feature extraction and SR reconstruction. Different from Kappeler et al. [1] and Bao et al. [13], the proposed method pre-trains the image SR model only.
Further, numerous video SR models [8,10,14,15,16] focus on performance improvement. Only a few models [11,17,18] take time consumption into account. Fewer models [19,20] are capable of real-time inference. However, real-time inference is important for online applications, such as displaying. Different from previous work [18] that purges unimportant filters, the proposed redundancy-aware inference algorithm reduces time consumption while maintaining all filters in a video SR model.
In this work, a novel video SR method is proposed to address these limitations. To exploit spatial information, the proposed method incorporates the architecture and well-trained weights of an image SR model as the foundational framework. A fast temporal information aggregation module is introduced to effectively leverage inter-frame dependency. Since moving objects exist in different positions, deformable convolution [21] can effectively extract adjacent frame information. Considering the difference of neighboring frames, the channel attention mechanism [22] can adaptively rescale important features, resulting in effective temporal aggregation. The proposed method achieves real-time inference while providing high-quality SR results. Furthermore, a redundancy-aware inference algorithm is developed to reduce repetitive feature extractions. The experiments on popular benchmarks show that the proposed method delivers solid quantitative performance and visual quality. On the one hand, the use of the pre-trained image SR model reduces the difficulty of training a video super-resolution model. On the other hand, it allows the other module to focus on temporal information aggregation. The redundancy-aware inference algorithm significantly reduces the inference latency, making it suitable for applications that need live video SR reconstruction.
The main contributions of this paper are as follows: (1) A novel video SR model is proposed to fully incorporate a pre-trained image SR model and achieve a trade-off between accuracy and real-time efficiency. (1) A novel video SR model is proposed that can be inferred in real-time while providing high-quality SR video frames. (2) A fast temporal information aggregation module is introduced where deformable convolution is adopted to extract the information of a moving object. The channel attention is also employed for adaptively capturing important information. (3) A redundancy-aware inference is developed for video SR. By avoiding repetitive feature extraction, the computational cost is significantly reduced.
The remainder of this paper is organized as follows: Section 2 discusses related works. Section 3 provides a detailed description of the network architecture and the redundancy-aware inference. Section 4 presents datasets, implementation details, experimental results, and analysis. Finally, Section 5 concludes this paper.

2. Related Works

2.1. Image Super-Resolution

The image SR problem is a typical ill-posed problem. In 2014, Dong et al. [23] were the first to introduce deep learning into this field. Since then, image SR methods have experienced noteworthy advancements [5]. In 2017, Lim et al. [24] proposed the representative EDSR, which made use of residual learning, eliminated unnecessary batch normalization, and expanded the number of parameters while ensuring stable training. To adaptively rescale features, Zhang et al. [22] developed the channel attention mechanism, which has been successfully employed in RCAN. In 2019, Hui et al. [25] presented IMDN, a lightweight model with a small memory footprint that yielded competitive accuracy and enabled quick inference. More recently, the Transformer, originally introduced in natural language processing [26], has been introduced into computer vision [27]. Consequently, the enhanced Swin Transformer [28] has been adopted in SwinIR [29]. By combining convolutional layers and Swin Transformer modules, the proposed approach captures both local and global dependencies simultaneously, resulting in state-of-the-art performance.
In this study, the IMDN [25] is employed as the foundational framework for the following reasons. A real-time video system must deliver a minimum of 24 frames per second, which is important for ensuring a seamless user experience. IMDN [25] has proven its capability in effectively leveraging spatial information for SR reconstruction with a lightweight design.

2.2. Video Super-Resolution

Recently, there has been a growing interest in the video (SR) problem, leading to the proposal of numerous deep learning-based models [2]. Given the need to leverage both spatial and temporal information, effectively handling the input low-resolution (LR) frames becomes crucial. We categorize existing methods into the following groups.
The first category includes methods that utilize optical flow. These methods make use of optical flow to align neighboring frames or features. For instance, VESPCN [11] aligns neighboring frames in a coarse-to-fine manner, while TOF [7] learns a task-specific optical flow. Additionally, DRVSR [14] introduces a carefully designed SPMC layer to register pixels in high-resolution, and Wang et al. [30] directly estimated HR optical flow from LR frames. BasicVSR [12] propagates neighbor features via the optical flow. Although these methods have demonstrated promising results, they suffer from high computational complexity. Moreover, inaccurate optical flow estimation can negatively impact the quality of SR results.
The second category contains methods based on 3D convolutions. Three-dimensional convolution is capable of extracting spatial and temporal information simultaneously from multiple input frames. For example, Kim et al. [31] applied 3D convolutions to capture spatio-temporal dependencies in an end-to-end manner, while DUF [6] incorporates 3D convolutions in densely connected blocks. Isobe et al. [32] fused information from neighboring frames using 3D convolutions, and Li et al. [17] proposed fast spatio-temporal residual blocks for reduced latency. The introduction of 3D convolutions alleviates the reliance on inaccurate optical flows and enables end-to-end training. However, the choice of the kernel size in 3D convolutions requires a trade-off between performance under large motion and computational cost.
The third category consists of methods employing deformable convolutions, which have gained popularity recently. Deformable convolutions were proposed in [21]. The learnable offset enables video SR models to capture objects with motion. For instance, Tian et al. [33] employed deformable convolutions to align neighboring frames, while D3Dnet [34] extends deformable convolutions from 2D to 3D for motion adaptivity and spatio-temporal information modeling. EDVR [8] introduces the Pyramid, Cascading, and Deformable convolutions module for neighboring feature alignment. Unlike optical flow-based methods, deformable convolution-based algorithms do not require optical flow estimation, thereby reducing computational costs and enabling end-to-end training.
In addition, there are attention-based approaches. These methods extract spatio-temporal information via various attention mechanisms. For example, Yi et al. [15] and Li et al. [16] adopted non-local attention. Xiao et al. [35] exploited the temporal difference attention. Wang et al. [36] and Xiao et al. [37] made use of deformable attention. Further, some studies [10,38] have employed self-attention mechanisms for video restoration. The attention mechanism can weigh different features according to the input. This allows a model to pay more attention to the key information, thereby improving its accuracy.
For better performance on video SR reconstruction, the proposed method incorporates both deformable convolution and channel attention. The proposed fast temporal information aggregation is achieved through two stages: spatial aggregation and subsequent temporal aggregation. In the spatial aggregation stage, the deformable convolution is employed to align neighboring features. In order to effectively aggregate information from neighboring video frames, channel attention is used. Further, both stages significantly contribute to reconstruction performance.

3. Method

3.1. Overall Architecture

The overall architecture of the proposed method is shown in Figure 1. It takes 2 n + 1 LR frames as input, centered around the target frame to be reconstructed at t = 0 . The 2 n represents the number of neighboring frames. The relative frame index is noted as t. The model consists of three key components, i.e., the spatial feature extraction module, the fast temporal information aggregation module, and the upsampler module. The spatial feature extraction module is based on a pre-trained image SR model called IMDN [25]. The fast temporal information aggregation module aligns and fuses neighboring frame features to exploit inter-frame dependencies. Finally, the upsampler module upscales the fused spatio-temporal representation to generate the SR output frame.
Figure 2a illustrates the spatial feature extraction module, comprising three convolutional layers with varying kernel sizes and six information multi-distillation blocks (IMDB) from IMDN [25]. Conv-3 and Conv-1 refer to the convolutional layers with kernel sizes of three and one, respectively. Additionally, it incorporates global residual learning and hierarchical feature exploitation. It is the foundational framework of the proposed method and is responsible for capturing effective spatial details from input LR frames.
As shown in Figure 3, there are two parts in the IMDB. There are four convolutional layers in the first part. The first three of them are followed by a leaky ReLU and channel split layer. In the channel split layer, the feature is divided into two features. The two features hold 1 / 4 and 3 / 4 channels of input feature, respectively. The feature with 1 / 4 channels is fed to the concatenation. The feature with 3 / 4 channels is processed by the following convolutional layers. After the concatenation, there is contrast-aware channel attention, which is the second part. It is a more advanced channel attention module that takes not only the average value but also the standard deviation of each feature channel into consideration.
The fast temporal information aggregation module is a key component that allows the model to leverage the inter-frame dependencies. It consists of two stages, i.e., spatial aggregation and temporal aggregation. The spatial aggregation stage gathers information about the same object and aligns it to the center frame. The subsequent temporal aggregation stage fuses information temporally. The details of this module are described in Section 3.2.
Figure 2b shows the upsampler module; it is the final component that converts the fused spatio-temporal features into SR output frames. It contains a convolutional layer and a sub-pixel layer. The convolutional layer adjusts the number of channels. Then, the sub-pixel layer upscales these features to target spatial resolution by rearranging elements from the channel dimension into the spatial dimension.
The specific design allows the spatial feature extraction module to extract information in a manner consistent with an image SR model. Consequently, the parameters of the spatial feature extraction module and the upsampler module can be initialized with well-trained parameters from an image SR model. Leveraging the spatial information extraction abilities learned by the image SR model, the utilization of these well-trained parameters enables the proposed model to make more effective use of spatial information from LR frames. Further, the spatial feature extraction module and upsampler module can be easily replaced by any other image SR models.
Given 2 n + 1 LR frames I t L R , the corresponding target HR frame at t = 0 is denoted as I H R . The super-resolved frame at t = 0 , I S R , can be produced by
I S R = N e t ( I t = n L R , , I t = 0 L R , , I t = n L R ) ,
where N e t ( · ) represents the proposed model. As illustrated in Figure 1, there are three modules in the proposed model, i.e., the spatial feature extraction module, the fast temporal information aggregation module, and the upsampler module. The proposed model can be further given by:
F t S = F E s p a t i a l ( I t L R ) ,
F T = F E a g g r e g a t i o n ( F t = n S , , F t = 0 S , , F t = n S )
I S R = U ( F T )
where F E s p a t i a l ( · ) , F E a g g r e g a t i o n ( · ) , and U ( · ) denote the spatial feature extraction module, fast temporal information aggregation module, and upsampler module, respectively. To optimize memory usage, the parameters of F E s p a t i a l ( · ) are shared across inputs with different timestamps. The spatial feature and temporal aggregated feature are represented as F t S and F T , respectively. Following previous work [33], the mean square error (MSE) is applied as the loss function for parameter optimization. For a sample from the training set, the loss function of the proposed model is defined as:
L ( Θ ) = 1 N i = 1 N N e t ( I t = n L R , i , , I t = 0 L R , i , , I t = n L R , i ) I H R , i 2
where Θ denotes the learnable parameters of the proposed model. Further, the L2 norm is · 2 . The index of the sample in a mini batch is represented by i.

3.2. Fast Temporal Information Aggregation Module

Figure 4 illustrates the architecture of the proposed fast temporal information aggregation module. The fast temporal information aggregation module aligns and fuses spatial features from the 2 n + 1 input frames to generate an enriched spatio-temporal feature. It has two stages, i.e., the spatial aggregation stage and the temporal aggregation stage. Thus, the fast temporal information aggregation module can be formulated as:
F t A = A g g r e g a t e s p a t i a l ( F t S ) ,
F T = A g g r e g a t e t e m p o r a l ( F t = n A , , F t = n A ) ,
where A g g r e g a t e s p a t i a l ( · ) and A g g r e g a t e t e m p o r a l ( · ) denote the spatial and temporal aggregation stage, respectively. The intermediate spatial aggregated feature is denoted as F t A , and F T represents the output of this module.
The spatial aggregation stage includes the fast spatial offset feature extraction (FSOFE), the spatial feature alignment, and the spatial feature refinement. The FSOFE is conducted on the spatial feature F t S to obtain the spatial offset feature F t S O . Then, for the spatial feature alignment, the offset feature F t O is estimated using a 3 × 3 convolutional layer. Following this, a deformable convolution is employed for alignment. Unlike the conventional deformable convolution, this variant incorporates additional features for offset estimation, utilizing F t S for feature extraction and F t O for offset information. Finally, another deformable convolution is applied to refine the results in the aligned feature F t A . Note that the spatial feature alignment and refinement are skipped for the center spatial feature. The spatial aggregation stage can be expressed as:
F t S O = F S O F E ( F t S ) ,
F t 0 O = C o n v 3 × 3 ( C o n c a t ( F t 0 S O , F t = 0 S O ) ) ,
F t 0 A = D C o n v ( A l i g n D C o n v ( F t 0 S , F t 0 O ) ) ,
F t = 0 A = F t = 0 S ,
where F S O F E ( · ) , C o n c a t ( · ) , C o n v 3 × 3 , A l i g n D C o n v ( · ) , and D C o n v ( · ) represent FSOFE, concatenation, convolution with a kernel size of 3, deformable convolution for alignment, and deformable convolution, respectively. The parameters of F S O F E ( · ) , C o n v 3 × 3 , A l i g n D C o n v ( · ) , and D C o n v ( · ) are shared to optimize memory consumption. t = 0 and t 0 denote the timestamps of the input center frame and its neighboring frames, respectively.
The FSOFE is responsible for extracting spatial offset features to guide the alignment by deformable convolution. As shown in Figure 5, it adopts a compact two-level hierarchical structure to extract offset efficiently. In the first level, the spatial feature is extracted by a residual block in [24]. In the second level, a 3 × 3 convolution with stride 2 is applied to reduce the spatial dimensions. The features from these two levels are fused by an element-wise addition and two residual blocks. The output features contain useful offset cues extracted from the spatial features and provide guidance for deformable convolution to adaptively aggregate and align the spatial features from neighboring frames. The two-level design allows the FSOFE to extract offset features with a large receptive field in an efficient manner.
The temporal aggregation stage combines the 2 n + 1 spatially aligned features F t A to generate a spatio-temporal feature. In order to effectively aggregate useful information, channel attention layer and RCAB [22] are employed. The channel attention adaptively rescales channels within a residual structure. Further, RCAB [22] extracts representative features for reconstruction. Further, a convolutional layer is placed between the channel attention layer and RCAB [22] to reduce the number of channels, resulting in lower inference latency. The optimal architecture of the temporal aggregation stage is provided in Table 1.
Motion among these frames provides valuable cues for reconstructing the center frame. The fast temporal information aggregation module generates a spatio-temporal feature that contains information from all input LR frames. Then, the spatio-temporal feature is upscaled to produce the SR result.

3.3. Redundancy-Aware Inference

In order to minimize the computational redundancy that arises during model inference, the redundancy-aware inference (RAI) algorithm is introduced. Considering the fact that once trained, the model parameters remain fixed. As stated in Equation (2), the spatial feature extraction module has to be performed on all neighboring LR frames. However, when inferring consecutive frames in a video, these repeated computations are redundant. This redundancy presents an opportunity to enhance the efficiency and reduce inference latency.
In the standard inference process, which is consistent with the training phase, the spatial feature extraction module is executed 2 n + 1 times to process each input frame separately. Thus, the latency for inferring a single frame can be expressed as follows:
( 2 n + 1 ) × L S F E + L F T I A + L U ,
where L S F E , L F T I A , and L U are the inference latency of the spatial feature extraction module, fast temporal information aggregation module, and upsampler module, respectively. However, this is redundant as the operations and parameters are identical each time. Hence, some intermediate features, such as F t S , remain consistent when generating adjacent SR frames. The RAI reduces this redundancy by caching and reusing these intermediate features. For subsequent frames, the cached features from previous timestamps are reused instead of recomputing them. Only the features from the new input frames need to be extracted. As a result, the latency for inferring a frame after the first n and before the last n frames can be improved to:
L S F E + L F T I A + L U ,
leading to a reduction in latency by 2 n × L S F E . Similarly, the output of FSOFE, as indicated in Equation (8), can be stored for further processing. Algorithm 1 provides the details of the RAI.
It is important to note that, for simplicity, the processing of the first n and last n frames is omitted. Due to the inconsistency in the processing at both ends, there is a performance degradation in these frames. However, in the proposed RAI, the spatial feature extraction module and the FSOFE are executed once instead of 2 n + 1 times. It allows for achieving real-time performance during inference without modifying the proposed model.   
Algorithm 1: Redundancy-Aware Inference Algorithm for the Proposed Model.
Sensors 23 07880 i001

4. Experiments

4.1. Dataset

In the experiments, Vimeo90K [7] is utilized for training. This dataset contains 64,612 video sequences for training. Each sequence is composed of seven frames. The Vimeo90K dataset has been widely acknowledged and used in various video-related tasks, such as video SR and video interpolation. To evaluate the performance of the proposed model, two well-known benchmarks are employed: Vid4 [33] and SPMCs-30 [14]. The Vid4 benchmark consists of 4 videos with a total of 171 frames. It maintains a minimal resolution of 720 × 480 . In addition to Vid4, the proposed method is evaluated on the SPMCs-30 benchmark, which consists of 30 videos and each video includes 31 frames. The resolution of video frames within the SPMCs-30 is 960 × 540 .

4.2. Implement Details

To generate LR frames, bicubic degradation is employed via the Matlab function imresize. The downsampling scale factor was set to four. During the training phase, the patch size of the ground truth (GT) and the mini-batch size were empirically set to 256 and 16, respectively. To capture temporal information, the number of neighboring frames is empirically set to two, resulting in the model taking five LR frames as input. Additionally, data augmentation techniques, such as random flipping and rotation, were applied to the training data. The Adam optimizer [39] is utilized to optimize the proposed method, with parameters β 1 = 0.9 and β 2 = 0.99 . The learning rate was initialized to 1 × 10 4 and gradually decayed to 1 × 10 7 . The training process lasted for 300,000 iterations. The channel number of the proposed model is empirically set to 64, except for the cases shown in Table 1. All experiments were conducted on a server with Python 3.8, PyTorch 1.12, Intel CPU, and Nvidia 2080Ti GPU.
For initializing the weight of the proposed method, the spatial feature extraction module and upsampler module load the weight of the pre-trained foundational framework, IMDN. The other parameters are initialized by PyTorch. No parameters are frozen when training the proposed method. The training of IMDN is consistent with [25]. The training set for IMDN is DIV2K [40]. The bicubic degradation is adopted to generate LR images. The channel number of IMDN is set to 64. Finally, the batch size for training IMDN is 16.
The performance of the reconstructed frames is assessed by two widely adopted metrics: peak signal-to-noise ratio (PSNR) and structure similarity index (SSIM) [41]. The PSNR of one SR frame is defined as:
PSNR = 10 log 10 255 2 MSE ,
and the mean squared error (MSE) is defined as:
MSE = 1 P p = 1 P I S R ( p ) I H R ( p ) 2 ,
where P represents the total number of pixels in a frame. I S R and I H R denote the SR frame result and HR frame reference, respectively. Further, SSIM is defined as:
SSIM ( I S R , I H R ) = 2 u I S R u I H R + k 1 u I S R 2 + u I H R 2 + k 1 · 2 σ I S R I H R + k 2 σ I S R 2 + σ I H R 2 + k 2
where u I S R and u I H R are the mean values of the SR and HR frames, respectively. σ I S R and σ I H R are the standard deviations of the SR and HR frames, respectively. k 1 and k 2 are used to stabilize the calculation and set to 0.01 and 0.03 , respectively. The covariance of the SR and HR frames is denoted as σ I S R I H R . Following previous studies [7,19,20,33], these metrics are calculated on the luminance channel (Y channel of YCbCr color space), while cropping the eight pixels near the boundary. Note that all frames were considered for performance evaluation.

4.3. Comparisons

For examining the performance of our model, comparisons with one image SR method (IMDN [25]) and six video SR methods (SWRN [19], 3DSRnet [31], TOF [7], EGVSR [20], SOFVSR [30], and RISTN [42]) are conducted. IMDN [25] is a lightweight image SR model and is employed as the foundational framework of the proposed method. SWRN [19] is a novel lightweight video SR method. 3DSRnet [31] is a video SR method that exploits spatial-temporal information via 3D convolution. TOF [7] focuses on estimating task-specific optical flow in videos. EGVSR [20] is a generative adversarial network-based model, and SOFVSR [30] predicts the HR optical flow to enhance video SR results. RISTN [42] leverages temporal features in a recurrent scheme.
First, the proposed method is evaluated on the Vid4 benchmark. The quantitative results are presented in Table 2 and Figure 6a. In each cell, the first row is the value of PSNR, and the second row is the value of SSIM. The quantitative results on the Vid4 benchmark demonstrate that our method outperforms others in terms of overall performance. Compared with foundational IMDN [25], the proposed method outperforms the PSNR and SSIM metrics by 1.06 and 0.057, respectively. The proposed method is better than the lightweight VSR methods, SWRN [19], and leads by 1.34 dB in PSNR metrics.
In addition, the proposed method is superior to TOF [7] and SOFVSR [30], which are VSR methods based on optical flow. Further, the performance of recurrent-based RISTN [42] is lower than the proposed approach. When compared with GAN-based EGVSR [20], the proposed method underperforms EGVSR on Calendar and City videos but outperforms EGVSR on Foliage and Walk videos. On average, the PSNR value of the proposed method is 0.44 dB higher than EGVSR [20], but the SSIM value is 0.005 lower. Thus, the proposed method demonstrates overall better performance due to its utilization of image SR models, which are excellent at exploiting spatial information. Further, the proposed fast temporal information aggregation module effectively leverages information from neighboring frames. Importantly, the inclusion of the proposed RAI did not negatively impact performance, with only a little degradation of 0.0093 dB and 0.0007 in terms of PSNR and SSIM, respectively.
For a qualitative comparison, the proposed method is compared with IMDN [25], SWRN [19], TOF [7], and SOFVSR [30]. As shown in Figure 7, frames from each video are presented, arranged from the top row to bottom as follows: Calendar, City, Foliage, and Walk. In addition, the first column is in the whole frame, the second column labeled GT is a reference to the compared patch, and the third through seventh columns are the results of different methods. The results of each method are marked with the PSNR. Notably, the proposed model delivers superior performance in terms of enhancing text clarity in the Calendar and improving the car’s boundaries in the Foliage. This can be attributed to our model’s utilization of an image SR model as its foundational framework, which gains the capacity to effectively extract and utilize spatial information. Additionally, the proposed method has good performance at reconstructing clear textures of buildings in the City. In Walk, the rope on the clothes is significantly more recognizable. In both of these scenarios, the aggregation of temporal information plays an important role in achieving these improved results.
In addition to the Vid4 benchmark, comparisons on the SPMCs-30 [14] benchmark are conducted. The quantitative results are presented in Table 3 and Figure 6b. On the SPMCs-30 benchmark, the proposed method surpasses all others in terms of average PSNR and SSIM metrics. Specifically, our method exhibits a remarkable improvement of 1.5 dB and 4.3% over SWRN [19] in terms of average PSNR and SSIM, respectively. Compared with optical flow-based methods, TOF [7] and SOFVSR [30], the proposed method outperforms by a margin of 0.8dB in terms of PSNR. Further, the recurrent-based RISTN [42] underperforms compared to the proposed method by 0.58 dB and 0.012 in terms of PSNR and SSIM. Thus, the proposed method makes better use of neighboring information than the recurrent scheme in RISTN [42].
The qualitative comparison is shown in Figure 8, where frames from six videos have been selected for analysis. Arranged from the top row to bottom, the videos are named as follows: AMVTG_004, hdclub_001, hdclub_003, hitachi_isee5, jvc_004, and LDVTG_009. In the case of AMVTG_004, it is evident that all compared models struggle to accurately reproduce the texture of the wall. The GT column is the high-resolution reference. Moreover, some methods result in the presence of undesired artifacts. Similarly, in hdclub_001, only the proposed method and SWRN demonstrate success in recovering the correct structure by effectively leveraging temporal information from neighboring frames. Regrettably, all compared methods exhibit poor performance in hdclub_003. However, the proposed method works well in reconstructing a clear and well-defined structure for both the building and flower in hdclub_003 and hitachi_isee5. The results obtained from jcv_004 show the ability of the proposed method to recover more details. Lastly, the SR frames of LDVTG_009 illustrate how the proposed method effectively utilized the ability of the image SR model, leading to improved results. These qualitative comparisons serve as compelling evidence of the superior performance and effectiveness of the proposed method.
The temporal consistency of the proposed model is evaluated following the methodology in a prior study [33]. The temporal profiles of different methods are shown in Figure 9, with each temporal profile generated at the specified location marked in red, as illustrated in the first column. The reference temporal profile of high-resolution video frames is shown in the GT column. As one can see, the proposed model exhibits superior performance in terms of generating smooth and clearly defined temporal profiles, particularly in Calendar and City. While artifacts are present in the temporal profile of Walk for all methods, the proposed approach demonstrates the fewest instances of such artifacts, indicating its ability to effectively preserve temporal consistency. These findings serve as robust evidence of the enhanced temporal performance of our method.

4.4. Efficiency

The efficiency is analyzed from four aspects: number of parameters, number of computational operations, inference latency, and quality of SR results. The float point operations (FLOPs) and latency of each model are evaluated by producing 100 SR frames with a resolution of 1280 × 720 . Further, all models are inferred with a Nvidia 2080ti GPU. The efficiency of the proposed method and compared models are presented in Table 4 and Figure 10. As shown in Table 4, there are four models that are capable of real-time inference. The number of parameters of IMDN [25] and SWRN [19] are relatively small. Further, the small computational complexity of IMDN [25] and SWRN [19] enables real-time inference. However, their PSNR performance is slightly lower than other methods. TOF [7] and SOFVSR [30] need more time for optical flow estimation, so they cannot achieve real-time inference. EGVSR [20] has more parameters than the proposed method. The proposed method performs well in terms of parameter count and PSNR, but it cannot achieve real-time inference due to redundancy. With the integration of the RAI, both latency and FLOPs drop significantly, leading the proposed method to produce real-time 720P SR frames while still achieving competitive performance. These results indicate that the RAI demonstrates an efficient and simple yet effective strategy to optimize the inference process by avoiding unnecessary computations. It achieves a balance between effectiveness and efficiency. Further, the modular design allows it to be integrated into other video models that require spatio-temporal feature extraction.

4.5. Ablation Analysis

In this section, ablation studies are presented to examine the impact of key components. IMDN establishes a baseline for comparison, which takes one LR frame as input. Subsequently, the spatial aggregation and temporal aggregation are evaluated. They are key stages in the fast temporal information aggregation module. For measuring the performance of the model with spatial aggregation only, the spatially aggregated features are fused using concatenation and a 3 × 3 convolutional layer. Table 5 provides the ablation studies of the proposed model, with the second and third columns specifically highlighting the variation.
On the Vid4 benchmark, the baseline model without temporal information achieves a PSNR result of 25.3254 dB and an SSIM result of 72.49%. By incorporating spatial aggregation, there is a noticeable improvement of 0.6499 dB and 3.96% in terms of PSNR and SSIM. Notably, the temporal aggregation in this variation is a simple 3 × 3 convolution. When the proposed temporal aggregation approach is employed, there is a further increase in performance, with an additional enhancement of 0.315 dB and 1.63% in terms of PSNR and SSIM, respectively. These results validate the significant contributions of both spatial and temporal aggregation components within our method.
Furthermore, an additional analysis is conducted to evaluate the impact of well-trained parameters from the image SR model on the video SR task. As shown in Table 5, the fourth column indicates whether the model was initialized with well-trained image SR parameters. The results demonstrate the significance of utilizing well-trained parameters in the video SR task. Model 4 exhibits superior performance compared to Model 1, while Model 5 outperforms Model 3. These findings suggest that incorporating well-trained parameters from an image SR model can effectively enhance the overall performance of the video SR task. This analysis further emphasizes the importance of leveraging existing knowledge and expertise in the field of image SR to improve the efficiency and effectiveness of video SR models.

4.6. Limitation

Although the proposed method can infer 720P video frames in real-time, there are some limitations. First, the LR video frame is synthesized by bicubic degradation. This may deviate from the degradation of actual low-resolution video. Secondly, the performance of the proposed method can be further improved. In Section 4.3, the proposed method achieves the overall best performance. However, it performs worse than IMDN in some videos. In addition, some reconstruction results are not very sharp, for example, “Sunday” and “Monday” in Figure 7. Thirdly, the time consumption is close to the boundary of real-time inference. There is still more room for improvement.

5. Conclusions

In this paper, a novel approach for real-time video super-resolution is presented. The method incorporates a pre-trained image super-resolution model as its foundational framework to effectively exploit spatial information. To further leverage inter-frame dependencies, a fast temporal information aggregation module is introduced with the utilization of deformable convolution. This temporal modeling extracts motion cues across frames to enrich the spatial details. Additionally, a redundancy-aware inference algorithm is developed to minimize redundant computations by reusing intermediate features. It reduces the inference latency, enabling real-time performance for 720p video super-resolution with a minimal impact on accuracy. Experiments on several benchmarks show that the proposed method produces high-quality SR results quantitatively and qualitatively. The real-time inference capability makes the proposed method suitable for practical applications requiring live video enhancement. In the future, the efficient video super-resolution approaches can be improved by but not limited to the following directions: advanced degradation model for real-world low-resolution video; attention mechanism for better spatial-temporal feature extraction; novel techniques for efficient inference.

Author Contributions

Conceptualization, W.W.; Methodology, W.W. and Z.L.; Software, W.W. and Z.Z.; Validation, W.W. and H.L.; Formal analysis, Z.L. and R.L.; Investigation, W.W., H.L. and Z.Z.; Resources, Z.L. and R.L.; Data curation, W.W., H.L. and Z.Z.; Writing—original draft preparation, W.W. and H.L.; Writing—review and editing, Z.L., R.L., H.L., Z.Z. and W.W.; Visualization, H.L., Z.Z. and W.W.; Supervision, Z.L.; Project administration, W.W. and H.L.; Funding acquisition, Z.L. and R.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (82272075, 61866009 and 62172120), Guangxi Key Research and Development Program (AB21220037 and ZY20198016), and Innovation Project of Guangxi Graduate Education (YCBZ2022112).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The public data used in this work are listed here: Vimeo90k (toflow.csail.mit.edu) http://toflow.csail.mit.edu/index.html#septuplet (accessed on 12 December 2022), Vid4 (Google Drive) https://drive.google.com/file/d/1ZuvNNLgR85TV_whJoHM7uVb-XW1y70DW/view?usp=sharing (accessed on 12 December 2022), and SPMCs-30 (GitHub) https://github.com/jiangsutx/SPMC_VideoSR (accessed on 12 December 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Kappeler, A.; Yoo, S.; Dai, Q.; Katsaggelos, A.K. Video Super-Resolution With Convolutional Neural Networks. IEEE Trans. Comput. Imaging 2016, 2, 109–122. [Google Scholar] [CrossRef]
  2. Rota, C.; Buzzelli, M.; Bianco, S.; Schettini, R. Video restoration based on deep learning: A comprehensive survey. Artif. Intell. Rev. 2023, 56, 5317–5364. [Google Scholar] [CrossRef]
  3. Farooq, M.; Dailey, M.N.; Mahmood, A.; Moonrinta, J.; Ekpanyapong, M. Human face super-resolution on poor quality surveillance video footage. Neural Comput. Appl. 2021, 33, 13505–13523. [Google Scholar] [CrossRef]
  4. Xiao, Y.; Su, X.; Yuan, Q.; Liu, D.; Shen, H.; Zhang, L. Satellite Video Super-Resolution via Multiscale Deformable Convolution Alignment and Temporal Grouping Projection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–19. [Google Scholar] [CrossRef]
  5. Anwar, S.; Khan, S.H.; Barnes, N. A Deep Journey into Super-resolution: A Survey. ACM Comput. Surv. 2020, 53, 60. [Google Scholar] [CrossRef]
  6. Jo, Y.; Oh, S.W.; Kang, J.; Kim, S.J. Deep Video Super-Resolution Network Using Dynamic Upsampling Filters without Explicit Motion Compensation. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; Computer Vision Foundation/IEEE Computer Society: Washington, DC, USA, 2018; pp. 3224–3232. [Google Scholar] [CrossRef]
  7. Xue, T.; Chen, B.; Wu, J.; Wei, D.; Freeman, W.T. Video Enhancement with Task-Oriented Flow. Int. J. Comput. Vis. 2019, 127, 1106–1125. [Google Scholar] [CrossRef]
  8. Wang, X.; Chan, K.C.K.; Yu, K.; Dong, C.; Loy, C.C. EDVR: Video Restoration With Enhanced Deformable Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, Long Beach, CA, USA, 16–20 June 2019; Computer Vision Foundation/IEEE: Washington, DC, USA, 2019; pp. 1954–1963. [Google Scholar] [CrossRef]
  9. Choi, Y.J.; Lee, Y.; Kim, B. Wavelet Attention Embedding Networks for Video Super-Resolution. In Proceedings of the 25th International Conference on Pattern Recognition, ICPR 2020, Milan, Italy, 10–15 January 2021; IEEE: Washington, DC, USA, 2020; pp. 7314–7320. [Google Scholar] [CrossRef]
  10. Liang, J.; Fan, Y.; Xiang, X.; Ranjan, R.; Ilg, E.; Green, S.; Cao, J.; Zhang, K.; Timofte, R.; Gool, L.V. Recurrent Video Restoration Transformer with Guided Deformable Attention. In Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, New Orleans, LA, USA, 28 November–9 December 2022; pp. 378–393. [Google Scholar]
  11. Caballero, J.; Ledig, C.; Aitken, A.P.; Acosta, A.; Totz, J.; Wang, Z.; Shi, W. Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society: Washington, DC, USA, 2017; pp. 2848–2857. [Google Scholar] [CrossRef]
  12. Chan, K.C.K.; Wang, X.; Yu, K.; Dong, C.; Loy, C.C. BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, Virtual, 19–25 June 2021; Computer Vision Foundation/IEEE: Washington, DC, USA, 2021; pp. 4947–4956. [Google Scholar] [CrossRef]
  13. Bao, W.; Lai, W.; Zhang, X.; Gao, Z.; Yang, M. MEMC-Net: Motion Estimation and Motion Compensation Driven Neural Network for Video Interpolation and Enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 933–948. [Google Scholar] [CrossRef] [PubMed]
  14. Tao, X.; Gao, H.; Liao, R.; Wang, J.; Jia, J. Detail-Revealing Deep Video Super-Resolution. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017; IEEE Computer Society: Washington, DC, USA, 2017; pp. 4482–4490. [Google Scholar] [CrossRef]
  15. Yi, P.; Wang, Z.; Jiang, K.; Jiang, J.; Ma, J. Progressive Fusion Video Super-Resolution Network via Exploiting Non-Local Spatio-Temporal Correlations. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Washington, DC, USA, 2019; pp. 3106–3115. [Google Scholar] [CrossRef]
  16. Li, W.; Tao, X.; Guo, T.; Qi, L.; Lu, J.; Jia, J. MuCAN: Multi-correspondence Aggregation Network for Video Super-Resolution. In Proceedings of the Computer Vision—ECCV 2020—16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part X; Lecture Notes in Computer Science; Vedaldi, A., Bischof, H., Brox, T., Frahm, J., Eds.; Springer: Cham, Switzerland, 2020; Volume 12355, pp. 335–351. [Google Scholar] [CrossRef]
  17. Li, S.; He, F.; Du, B.; Zhang, L.; Xu, Y.; Tao, D. Fast Spatio-Temporal Residual Network for Video Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; Computer Vision Foundation/IEEE: Washington, DC, USA, 2019; pp. 10522–10531. [Google Scholar] [CrossRef]
  18. Xia, B.; He, J.; Zhang, Y.; Wang, Y.; Tian, Y.; Yang, W.; Van Gool, L. Structured Sparsity Learning for Efficient Video Super-Resolution. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 22638–22647. [Google Scholar] [CrossRef]
  19. Lian, W.; Lian, W. Sliding Window Recurrent Network for Efficient Video Super-Resolution. In Proceedings of the Computer Vision—ECCV 2022 Workshops—Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part II; Lecture Notes in Computer Science; Karlinsky, L., Michaeli, T., Nishino, K., Eds.; Springer: Cham, Switzerland, 2022; Volume 13802, pp. 591–601. [Google Scholar] [CrossRef]
  20. Cao, Y.; Wang, C.; Song, C.; Tang, Y.; Li, H. Real-Time Super-Resolution System of 4K-Video Based on Deep Learning. In Proceedings of the 32nd IEEE International Conference on Application-Specific Systems, Architectures and Processors, ASAP 2021, Virtual Conference, 7–9 July 2021; IEEE: Washington, DC, USA, 2021; pp. 69–76. [Google Scholar] [CrossRef]
  21. Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017; IEEE Computer Society: Washington, DC, USA, 2017; pp. 764–773. [Google Scholar] [CrossRef]
  22. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. In Proceedings of the Computer Vision—ECCV 2018—15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part VII; Lecture Notes in Computer Science; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; Volume 11211, pp. 294–310. [Google Scholar] [CrossRef]
  23. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image Super-Resolution Using Deep Convolutional Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
  24. Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society: Washington, DC, USA, 2017; pp. 1132–1140. [Google Scholar] [CrossRef]
  25. Hui, Z.; Gao, X.; Yang, Y.; Wang, X. Lightweight Image Super-Resolution with Information Multi-distillation Network. In Proceedings of the Proceedings of the 27th ACM International Conference on Multimedia, MM 2019, Nice, France, 21–25 October 2019; Amsaleg, L., Huet, B., Larson, M.A., Gravier, G., Hung, H., Ngo, C., Ooi, W.T., Eds.; ACM: New York, NY, USA, 2019; pp. 2024–2032. [Google Scholar] [CrossRef]
  26. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
  27. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, 3–7 May 2021. [Google Scholar]
  28. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; IEEE: Washington, DC, USA, 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
  29. Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Gool, L.V.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2021, Montreal, BC, Canada, 11–17 October 2021; IEEE: Washington, DC, USA, 2021; pp. 1833–1844. [Google Scholar] [CrossRef]
  30. Wang, L.; Guo, Y.; Liu, L.; Lin, Z.; Deng, X.; An, W. Deep Video Super-Resolution Using HR Optical Flow Estimation. IEEE Trans. Image Process. 2020, 29, 4323–4336. [Google Scholar] [CrossRef] [PubMed]
  31. Kim, S.Y.; Lim, J.; Na, T.; Kim, M. Video Super-Resolution Based on 3D-CNNS with Consideration of Scene Change. In Proceedings of the 2019 IEEE International Conference on Image Processing, ICIP 2019, Taipei, Taiwan, 22–25 September 2019; IEEE: Washington, DC, USA, 2019; pp. 2831–2835. [Google Scholar] [CrossRef]
  32. Isobe, T.; Li, S.; Jia, X.; Yuan, S.; Slabaugh, G.G.; Xu, C.; Li, Y.; Wang, S.; Tian, Q. Video Super-Resolution With Temporal Group Attention. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; Computer Vision Foundation/IEEE: Washington, DC, USA, 2020; pp. 8005–8014. [Google Scholar] [CrossRef]
  33. Tian, Y.; Zhang, Y.; Fu, Y.; Xu, C. TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; Computer Vision Foundation/IEEE: Washington, DC, USA, 2020; pp. 3357–3366. [Google Scholar] [CrossRef]
  34. Ying, X.; Wang, L.; Wang, Y.; Sheng, W.; An, W.; Guo, Y. Deformable 3D Convolution for Video Super-Resolution. IEEE Signal Process. Lett. 2020, 27, 1500–1504. [Google Scholar] [CrossRef]
  35. Xiao, Y.; Yuan, Q.; Jiang, K.; Jin, X.; He, J.; Zhang, L.; Lin, C. Local-Global Temporal Difference Learning for Satellite Video Super-Resolution. arXiv 2023, arXiv:2304.04421. [Google Scholar] [CrossRef]
  36. Wang, H.; Xiang, X.; Tian, Y.; Yang, W.; Liao, Q. STDAN: Deformable Attention Network for Space-Time Video Super-Resolution. IEEE Trans. Neural Netw. Learn. Syst. 2023, 1–11. [Google Scholar] [CrossRef] [PubMed]
  37. Xiao, Y.; Yuan, Q.; Zhang, Q.; Zhang, L. Deep Blind Super-Resolution for Satellite Video. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5516316. [Google Scholar] [CrossRef]
  38. Shi, S.; Gu, J.; Xie, L.; Wang, X.; Yang, Y.; Dong, C. Rethinking Alignment in Video Super-Resolution Transformers. In Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; pp. 36081–36093. [Google Scholar]
  39. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations—ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  40. Agustsson, E.; Timofte, R. NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study. In Proceedings of the The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  41. Wang, Z.; Bovik, A.; Sheikh, H.; Simoncelli, E. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
  42. Zhu, X.; Li, Z.; Zhang, X.Y.; Li, C.; Liu, Y.; Xue, Z. Residual Invertible Spatio-Temporal Network for Video Super-Resolution. Proc. AAAI Conf. Artif. Intell. 2019, 33, 5981–5988. [Google Scholar] [CrossRef]
Figure 1. Overall Architecture of the Proposed Method.
Figure 1. Overall Architecture of the Proposed Method.
Sensors 23 07880 g001
Figure 2. Details of the Spatial Feature Extraction Module and Upsampler Module.
Figure 2. Details of the Spatial Feature Extraction Module and Upsampler Module.
Sensors 23 07880 g002
Figure 3. Details of Information Multi-Distillation Blocks Module.
Figure 3. Details of Information Multi-Distillation Blocks Module.
Sensors 23 07880 g003
Figure 4. Architecture of the Proposed Fast Temporal Information Aggregation Module.
Figure 4. Architecture of the Proposed Fast Temporal Information Aggregation Module.
Sensors 23 07880 g004
Figure 5. Architecture of Proposed Fast Spatial Offset Feature Extraction.
Figure 5. Architecture of Proposed Fast Spatial Offset Feature Extraction.
Sensors 23 07880 g005
Figure 6. Quantitative Comparison on the Vid4 and SPMCs-30 Benchmarks.
Figure 6. Quantitative Comparison on the Vid4 and SPMCs-30 Benchmarks.
Sensors 23 07880 g006
Figure 7. Qualitative Comparisons on the Vid4 Benchmark.
Figure 7. Qualitative Comparisons on the Vid4 Benchmark.
Sensors 23 07880 g007
Figure 8. Qualitative Comparisons on the SPMCs-30 Benchmark.
Figure 8. Qualitative Comparisons on the SPMCs-30 Benchmark.
Sensors 23 07880 g008
Figure 9. Qualitative Comparisons of Temporal Profiles.
Figure 9. Qualitative Comparisons of Temporal Profiles.
Sensors 23 07880 g009
Figure 10. Latency and PSNR on the Vid4 Benchmark.
Figure 10. Latency and PSNR on the Vid4 Benchmark.
Sensors 23 07880 g010
Table 1. Details of the Temporal Aggregation Stage.
Table 1. Details of the Temporal Aggregation Stage.
Layer No.Input Layer No.LayerInput ChannelsOutput Channels
0 Input 1 ×
10Concatenation ( 2 n + 1 ) × ( 2 n + 1 ) ×
21Channel Attention ( 2 n + 1 ) × ( 2 n + 1 ) ×
31 and 2Elementwise Add ( 2 n + 1 ) × ( 2 n + 1 ) ×
43Convolution 3 × 3 ( 2 n + 1 ) × 2 ×
54RCAB 2 × 2 ×
65LeakyReLU 2 × 2 ×
76Channel Attention 2 × 2 ×
86 and 7Elementwise Add 2 × 2 ×
98Convolution 3 × 3 2 × 1 ×
109RCAB 1 × 1 ×
1110LeakyReLU 1 × 1 ×
1211Output 1 ×
Table 2. Quantitative Comparison on the Vid4 Benchmark. The best and second-best results are marked in red and blue, respectively.
Table 2. Quantitative Comparison on the Vid4 Benchmark. The best and second-best results are marked in red and blue, respectively.
MethodCalendarCityFoliageWalkAverage
IMDN [25]22.118525.973324.673728.413125.2947
0.70780.68110.65640.86930.7287
SWRN [19]21.702825.897624.478327.972525.0128
0.67490.67310.64580.85850.7131
3DSRnet [31]22.517427.108625.557127.754025.7433
0.65860.69870.68980.86810.7288
TOF [7]22.437126.664725.345128.945925.8482
0.72420.73560.7070.87990.7617
EGVSR [20]23.558527.424224.734827.947625.9163
0.79590.80090.70910.85990.7915
SOFVSR [30]22.764426.817525.531529.113626.0568
0.74630.74950.71820.88350.7744
RISTN [42]22.917126.997525.576129.223826.1786
0.75040.75820.72210.88140.7780
Ours23.074727.041225.684029.629826.3574
0.76260.76420.72550.89370.7865
Ours (RAI)23.068227.024425.673029.614726.3451
0.76210.76270.72480.89350.7858
Table 3. Quantitative Comparison on the SPMCs-30 Benchmark. The best and second-best results are marked in red and blue, respectively.
Table 3. Quantitative Comparison on the SPMCs-30 Benchmark. The best and second-best results are marked in red and blue, respectively.
MethodIMDN [25]SWRN [19]TOF [7]EGVSR [20]SOFVSR [30]RISTN [42]OursOurs (RAI)
AMVTG_00426.031424.937324.869722.794625.190325.617726.455426.4566
0.71790.63980.64420.55830.66430.68200.74560.7456
HKVTG_00428.545128.255428.469825.335728.659728.679328.882428.8816
0.75190.73910.74830.62340.75960.75970.76940.7693
LDVTG_00926.826725.785726.441527.179727.099327.678327.735427.7327
0.83530.80520.83450.85510.85010.85880.86530.8653
LDVTG_02229.742929.283129.307327.100429.468229.879530.129130.1330
0.84960.83520.83920.78220.84320.85020.86040.8605
NYVTG_00629.449929.299130.200025.965330.945930.651630.908030.9066
0.85640.84520.85990.76470.87850.87230.88190.8818
PRVTG_00823.879023.416723.818121.134224.103624.294224.573924.5743
0.67640.65340.67410.55330.69590.70670.71670.7166
PRVTG_01226.437126.269426.550424.564626.704826.852526.941326.9435
0.77180.76280.77500.71290.78390.78990.79450.7945
RMVTG_01125.858125.401725.972223.842026.277226.504826.695026.6929
0.74580.72540.74880.67030.76420.77110.77970.7795
RMVTG_02425.283224.956325.364223.545825.701625.871825.966525.9703
0.66640.64880.67200.61620.69720.71030.71090.7109
TPVTG_00330.313129.761029.936127.301829.971430.320530.723830.7237
0.88150.86870.87250.78600.87530.87730.88980.8898
cact1_00132.250131.113632.152231.322832.343232.523033.381133.3668
0.90750.89040.91960.91110.92140.91780.92840.9281
car05_00129.538229.179530.028728.961830.183929.893530.092230.0779
0.84230.83380.86200.83090.86370.84230.85370.8532
gree3_00129.986329.610829.661526.913529.934230.144430.356730.3561
0.81190.80300.81020.70170.81800.81630.82490.8248
hdclub_00123.886423.444623.849422.824624.065024.518524.895824.9024
0.73870.71480.74840.73380.76030.77560.78690.7870
hdclub_00320.351820.145320.863919.309121.025521.286821.177421.1785
0.60150.58810.65350.62070.66790.68960.67510.6752
hdclub_00825.939225.719726.118824.630626.219826.278926.421326.4210
0.71840.70450.73120.66470.73690.73810.74730.7473
hitachi_isee523.125821.951622.978324.610323.341824.147724.341524.3261
0.80790.75680.80550.86550.81870.83580.84700.8464
hk001_00129.654529.103029.568026.709529.845630.109330.481630.4851
0.75800.74590.77580.67020.78450.78680.79490.7948
hk004_00630.895830.127030.528328.155731.095931.109331.750231.7509
0.84630.83240.85590.80030.86340.86160.87190.8719
indi1_00433.074632.232232.844331.754633.094533.413034.288234.2895
0.88910.87090.89130.88700.89760.89880.91450.9145
indi1_03234.641633.420834.473533.417834.587734.872536.291536.2836
0.92460.90770.93150.92010.93520.93030.94690.9467
jvc_00429.852328.722730.009731.653629.993030.635731.059131.0441
0.94940.93290.95130.96320.95020.95310.96120.9611
jvc_00927.731327.117227.794726.830228.053928.086528.678128.6749
0.85000.82630.85170.84620.86060.85710.87580.8756
land5_00137.099535.707036.032534.810135.921436.737138.251238.2421
0.96020.95480.96370.95730.96470.95780.96810.9680
land9_00734.943133.931434.451232.281534.716935.095836.171736.1766
0.91610.90940.92880.88390.92970.92640.93600.9361
philips_hkc0135.611433.999534.230234.670634.998434.705636.291036.2628
0.93730.91040.91250.92540.92740.91930.94290.9426
philips_hkc0434.297833.518334.143231.067232.654332.882434.280234.2844
0.89270.87970.89170.82670.87120.86240.88790.8880
philips_hkc0530.388229.387930.857632.118230.833230.731131.277131.2218
0.84480.80520.85760.90170.85870.85410.87110.8692
philips_hkc1136.410135.225535.744832.617435.605735.564436.854836.8287
0.89610.87130.88460.83160.88500.88100.90420.9036
veni3_01133.116532.210532.753629.862433.152433.173834.572334.5763
0.95620.94440.95270.90470.95710.95240.96450.9646
Average29.505428.774529.333827.776229.526329.750530.330830.3255
0.82670.80690.82830.78560.83610.83780.85060.8504
Table 4. Quantitative Comparison of Efficiency for Producing 720P Frames.
Table 4. Quantitative Comparison of Efficiency for Producing 720P Frames.
MethodParametersFLOPsLatency (ms)Real-Time InferencePSNR on Vid4PSNR on SPMCs-30
IMDN [25]715K40.91G12.5025.294729.5054
SWRN [19]43K5.00G9.5025.012828.7744
TOF [7]1405K133.06G545.7725.848229.3338
EGVSR [20]2587K102.89G14.1725.916327.7762
SOFVSR [30]1048K120.83G128.3626.056829.5263
Ours1895K336.40G101.4726.357430.3308
Ours (RAI)1895K109.07G39.5126.345130.3255
Table 5. Quantitative Performance for the Ablation Study.
Table 5. Quantitative Performance for the Ablation Study.
ModelSpatial AggregationTemporal AggregationPre-Trained ParametersVid4SPMCs-30
PSNRSSIMPSNRSSIM
Model 1NoNoNo25.32540.724929.39720.8216
Model 2YesNoNo25.97530.764529.62120.8317
Model 3YesYesNo26.29030.780830.04630.8429
Model 4NoNoYes25.44210.731829.65740.8288
Model 5YesYesYes26.35740.786530.33080.8506
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, W.; Liu, Z.; Lu, H.; Lan, R.; Zhang, Z. Real-Time Video Super-Resolution with Spatio-Temporal Modeling and Redundancy-Aware Inference. Sensors 2023, 23, 7880. https://doi.org/10.3390/s23187880

AMA Style

Wang W, Liu Z, Lu H, Lan R, Zhang Z. Real-Time Video Super-Resolution with Spatio-Temporal Modeling and Redundancy-Aware Inference. Sensors. 2023; 23(18):7880. https://doi.org/10.3390/s23187880

Chicago/Turabian Style

Wang, Wenhao, Zhenbing Liu, Haoxiang Lu, Rushi Lan, and Zhaoyuan Zhang. 2023. "Real-Time Video Super-Resolution with Spatio-Temporal Modeling and Redundancy-Aware Inference" Sensors 23, no. 18: 7880. https://doi.org/10.3390/s23187880

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop