4.2. Implement Details
To generate LR frames, bicubic degradation is employed via the Matlab function
imresize. The downsampling scale factor was set to four. During the training phase, the patch size of the ground truth (GT) and the mini-batch size were empirically set to 256 and 16, respectively. To capture temporal information, the number of neighboring frames is empirically set to two, resulting in the model taking five LR frames as input. Additionally, data augmentation techniques, such as random flipping and rotation, were applied to the training data. The Adam optimizer [
39] is utilized to optimize the proposed method, with parameters
and
. The learning rate was initialized to
and gradually decayed to
. The training process lasted for 300,000 iterations. The channel number of the proposed model is empirically set to 64, except for the cases shown in
Table 1. All experiments were conducted on a server with Python 3.8, PyTorch 1.12, Intel CPU, and Nvidia 2080Ti GPU.
For initializing the weight of the proposed method, the spatial feature extraction module and upsampler module load the weight of the pre-trained foundational framework, IMDN. The other parameters are initialized by PyTorch. No parameters are frozen when training the proposed method. The training of IMDN is consistent with [
25]. The training set for IMDN is DIV2K [
40]. The bicubic degradation is adopted to generate LR images. The channel number of IMDN is set to 64. Finally, the batch size for training IMDN is 16.
The performance of the reconstructed frames is assessed by two widely adopted metrics: peak signal-to-noise ratio (PSNR) and structure similarity index (SSIM) [
41]. The PSNR of one SR frame is defined as:
and the mean squared error (MSE) is defined as:
where
P represents the total number of pixels in a frame.
and
denote the SR frame result and HR frame reference, respectively. Further, SSIM is defined as:
where
and
are the mean values of the SR and HR frames, respectively.
and
are the standard deviations of the SR and HR frames, respectively.
and
are used to stabilize the calculation and set to
and
, respectively. The covariance of the SR and HR frames is denoted as
. Following previous studies [
7,
19,
20,
33], these metrics are calculated on the luminance channel (Y channel of YCbCr color space), while cropping the eight pixels near the boundary. Note that all frames were considered for performance evaluation.
4.3. Comparisons
For examining the performance of our model, comparisons with one image SR method (IMDN [
25]) and six video SR methods (SWRN [
19], 3DSRnet [
31], TOF [
7], EGVSR [
20], SOFVSR [
30], and RISTN [
42]) are conducted. IMDN [
25] is a lightweight image SR model and is employed as the foundational framework of the proposed method. SWRN [
19] is a novel lightweight video SR method. 3DSRnet [
31] is a video SR method that exploits spatial-temporal information via 3D convolution. TOF [
7] focuses on estimating task-specific optical flow in videos. EGVSR [
20] is a generative adversarial network-based model, and SOFVSR [
30] predicts the HR optical flow to enhance video SR results. RISTN [
42] leverages temporal features in a recurrent scheme.
First, the proposed method is evaluated on the Vid4 benchmark. The quantitative results are presented in
Table 2 and
Figure 6a. In each cell, the first row is the value of PSNR, and the second row is the value of SSIM. The quantitative results on the Vid4 benchmark demonstrate that our method outperforms others in terms of overall performance. Compared with foundational IMDN [
25], the proposed method outperforms the PSNR and SSIM metrics by 1.06 and 0.057, respectively. The proposed method is better than the lightweight VSR methods, SWRN [
19], and leads by 1.34 dB in PSNR metrics.
In addition, the proposed method is superior to TOF [
7] and SOFVSR [
30], which are VSR methods based on optical flow. Further, the performance of recurrent-based RISTN [
42] is lower than the proposed approach. When compared with GAN-based EGVSR [
20], the proposed method underperforms EGVSR on Calendar and City videos but outperforms EGVSR on Foliage and Walk videos. On average, the PSNR value of the proposed method is 0.44 dB higher than EGVSR [
20], but the SSIM value is 0.005 lower. Thus, the proposed method demonstrates overall better performance due to its utilization of image SR models, which are excellent at exploiting spatial information. Further, the proposed fast temporal information aggregation module effectively leverages information from neighboring frames. Importantly, the inclusion of the proposed RAI did not negatively impact performance, with only a little degradation of 0.0093 dB and 0.0007 in terms of PSNR and SSIM, respectively.
For a qualitative comparison, the proposed method is compared with IMDN [
25], SWRN [
19], TOF [
7], and SOFVSR [
30]. As shown in
Figure 7, frames from each video are presented, arranged from the top row to bottom as follows: Calendar, City, Foliage, and Walk. In addition, the first column is in the whole frame, the second column labeled GT is a reference to the compared patch, and the third through seventh columns are the results of different methods. The results of each method are marked with the PSNR. Notably, the proposed model delivers superior performance in terms of enhancing text clarity in the Calendar and improving the car’s boundaries in the Foliage. This can be attributed to our model’s utilization of an image SR model as its foundational framework, which gains the capacity to effectively extract and utilize spatial information. Additionally, the proposed method has good performance at reconstructing clear textures of buildings in the City. In Walk, the rope on the clothes is significantly more recognizable. In both of these scenarios, the aggregation of temporal information plays an important role in achieving these improved results.
In addition to the Vid4 benchmark, comparisons on the SPMCs-30 [
14] benchmark are conducted. The quantitative results are presented in
Table 3 and
Figure 6b. On the SPMCs-30 benchmark, the proposed method surpasses all others in terms of average PSNR and SSIM metrics. Specifically, our method exhibits a remarkable improvement of 1.5 dB and 4.3% over SWRN [
19] in terms of average PSNR and SSIM, respectively. Compared with optical flow-based methods, TOF [
7] and SOFVSR [
30], the proposed method outperforms by a margin of 0.8dB in terms of PSNR. Further, the recurrent-based RISTN [
42] underperforms compared to the proposed method by 0.58 dB and 0.012 in terms of PSNR and SSIM. Thus, the proposed method makes better use of neighboring information than the recurrent scheme in RISTN [
42].
The qualitative comparison is shown in
Figure 8, where frames from six videos have been selected for analysis. Arranged from the top row to bottom, the videos are named as follows: AMVTG_004, hdclub_001, hdclub_003, hitachi_isee5, jvc_004, and LDVTG_009. In the case of AMVTG_004, it is evident that all compared models struggle to accurately reproduce the texture of the wall. The GT column is the high-resolution reference. Moreover, some methods result in the presence of undesired artifacts. Similarly, in hdclub_001, only the proposed method and SWRN demonstrate success in recovering the correct structure by effectively leveraging temporal information from neighboring frames. Regrettably, all compared methods exhibit poor performance in hdclub_003. However, the proposed method works well in reconstructing a clear and well-defined structure for both the building and flower in hdclub_003 and hitachi_isee5. The results obtained from jcv_004 show the ability of the proposed method to recover more details. Lastly, the SR frames of LDVTG_009 illustrate how the proposed method effectively utilized the ability of the image SR model, leading to improved results. These qualitative comparisons serve as compelling evidence of the superior performance and effectiveness of the proposed method.
The temporal consistency of the proposed model is evaluated following the methodology in a prior study [
33]. The temporal profiles of different methods are shown in
Figure 9, with each temporal profile generated at the specified location marked in red, as illustrated in the first column. The reference temporal profile of high-resolution video frames is shown in the GT column. As one can see, the proposed model exhibits superior performance in terms of generating smooth and clearly defined temporal profiles, particularly in Calendar and City. While artifacts are present in the temporal profile of Walk for all methods, the proposed approach demonstrates the fewest instances of such artifacts, indicating its ability to effectively preserve temporal consistency. These findings serve as robust evidence of the enhanced temporal performance of our method.
4.4. Efficiency
The efficiency is analyzed from four aspects: number of parameters, number of computational operations, inference latency, and quality of SR results. The float point operations (FLOPs) and latency of each model are evaluated by producing 100 SR frames with a resolution of
. Further, all models are inferred with a Nvidia 2080ti GPU. The efficiency of the proposed method and compared models are presented in
Table 4 and
Figure 10. As shown in
Table 4, there are four models that are capable of real-time inference. The number of parameters of IMDN [
25] and SWRN [
19] are relatively small. Further, the small computational complexity of IMDN [
25] and SWRN [
19] enables real-time inference. However, their PSNR performance is slightly lower than other methods. TOF [
7] and SOFVSR [
30] need more time for optical flow estimation, so they cannot achieve real-time inference. EGVSR [
20] has more parameters than the proposed method. The proposed method performs well in terms of parameter count and PSNR, but it cannot achieve real-time inference due to redundancy. With the integration of the RAI, both latency and FLOPs drop significantly, leading the proposed method to produce real-time 720P SR frames while still achieving competitive performance. These results indicate that the RAI demonstrates an efficient and simple yet effective strategy to optimize the inference process by avoiding unnecessary computations. It achieves a balance between effectiveness and efficiency. Further, the modular design allows it to be integrated into other video models that require spatio-temporal feature extraction.
4.5. Ablation Analysis
In this section, ablation studies are presented to examine the impact of key components. IMDN establishes a baseline for comparison, which takes one LR frame as input. Subsequently, the spatial aggregation and temporal aggregation are evaluated. They are key stages in the fast temporal information aggregation module. For measuring the performance of the model with spatial aggregation only, the spatially aggregated features are fused using concatenation and a
convolutional layer.
Table 5 provides the ablation studies of the proposed model, with the second and third columns specifically highlighting the variation.
On the Vid4 benchmark, the baseline model without temporal information achieves a PSNR result of 25.3254 dB and an SSIM result of 72.49%. By incorporating spatial aggregation, there is a noticeable improvement of 0.6499 dB and 3.96% in terms of PSNR and SSIM. Notably, the temporal aggregation in this variation is a simple convolution. When the proposed temporal aggregation approach is employed, there is a further increase in performance, with an additional enhancement of 0.315 dB and 1.63% in terms of PSNR and SSIM, respectively. These results validate the significant contributions of both spatial and temporal aggregation components within our method.
Furthermore, an additional analysis is conducted to evaluate the impact of well-trained parameters from the image SR model on the video SR task. As shown in
Table 5, the fourth column indicates whether the model was initialized with well-trained image SR parameters. The results demonstrate the significance of utilizing well-trained parameters in the video SR task. Model 4 exhibits superior performance compared to Model 1, while Model 5 outperforms Model 3. These findings suggest that incorporating well-trained parameters from an image SR model can effectively enhance the overall performance of the video SR task. This analysis further emphasizes the importance of leveraging existing knowledge and expertise in the field of image SR to improve the efficiency and effectiveness of video SR models.