Parallel Spatio-Temporal Attention Transformer for Video Frame Interpolation
Abstract
:1. Introduction
- We propose a novel Transformer-based VFI framework. It overcomes the limitations of traditional CNN-based methods and can effectively model the long-range dependencies between pixels.
- We design a new attention mechanism, PSTA. It is divided into TA and SA, and the mechanism can process inter-frame spatio-temporal information in parallel to efficiently process video frames. TA captures inter-frame pixel temporal variations, and SA efficiently aggregates spatial features. SA is designed as the combination of both convolutional and self-attention paradigms, and the input features are processed by a simple mapping approach to suit both paradigms, which improves the quality and realism of the synthesized frames. We also propose two sub-networks, CE-Net and MPFS-Net, for enhancing the details of synthesized frames and fusing the information of multi-scale video frames, respectively.
- Our model demonstrates significant performance on various benchmark datasets, with higher processing efficiency and fewer parameters. As shown in Figure 1, our standard model (Ours) outperforms the state-of-the-art (SOTA) methods ABME [15] and FLAVR [16] by 0.19 dB and 0.07 dB, respectively, with only 95.02% and 40.56% of their parameters, respectively.
2. Related Work
2.1. Video Frame Interpolation
2.2. Vision Transformer
3. Proposed Method
3.1. Parallel Spatio-Temporal Attention
3.1.1. Spatio Attention
3.1.2. Temporal Attention
3.1.3. Computational Cost
3.2. Context Extraction Network
3.3. Multi-Scale Prediction Frame Synthesis Network
4. Experiment
4.1. Datasets and Metrics
- Vimeo90K [8]: This is a popular dataset widely used in VFI, which consists of three consecutive video frames, the training set contains 51,312 triples with a resolution of 448 × 256. The testing set contains 3782 triples with the same resolution of 448 × 256.
- Middlebury [9]: This is a classic visual benchmark for evaluation that provides a wealth of data on realistic scenes. We choose its OTHER testing set, which contains ground-truth and has a resolution of 640 × 480.
- X4K1000FPS [17]: This is a 4K video dataset that is typically used to evaluate the performance of models for multi-frame interpolation in ultra-high definition (UHD) resolution scenes. We generate intermediate frames by iteratively using our model to achieve the 8× frame interpolation testing on this dataset at both 4K and 2K resolutions.
- UCF101 [35]: This dataset contains rich videos of human behavior and is suitable for video action recognition and video interpolation, among other tasks. It consists of 379 video triples, each with a resolution of 256 × 256.
- SNU-FILM [36]: This dataset provides high-quality video sequences and various motion types. It contains 1240 video triplets with a resolution of 1280 × 720. Based on the difficulty of the motion, it is divided into four subsets: easy, medium, hard, and extreme.
- HD [37]: The dataset, collected by Bao et al. [37], contains 11 videos, including four 1080p, three 720p, and four 1280 × 544 videos. It is often used to evaluate the performance of the model for multi-frame interpolation in high-definition (HD) resolution scenes. We choose videos with resolutions of 1080p and 720p for testing, and generate intermediate frames by iteratively using our model in order to perform the 4× frame interpolation testing on this dataset.
- Metrics: We use metrics such as peak signal-to-noise ratio (PSNR), structural similarity (SSIM) [38], and average interpolation error (IE) to evaluate models. On the Middlebury [9] dataset, we calculate its IE value, where a lower IE represents a better performance. Meanwhile, we evaluate the PSNR and SSIM of the models on Vimeo90K [8], X4K1000FPS [17], UCF101 [35], SNU-FILM [36], and HD [37] datasets, where higher PSNR and SSIM indicate better performance.
4.2. Implementation Details
- Network Architecture: The encoder has five layers, including an embedding Layer and four PSTAT layers, and the feature channel dimensions of each layer are 32, 32, 64, 128, and 256, respectively, and the downsampling scale factor of each PSTAT layer is 2. There is a skip link between the encoder and decoder. In SA, we set the parameter to 0.3 [22], which controls the strength of the convolutional output. We introduce two variants of our model: Ours and Ours-small. Both models are identical in all aspects except for the channel dimensions, where the Ours-small model’s channel dimension is set to half of that in the standard Ours model.
- Training Details: We crop each training sample of the Vimeo90K [8] training set to 192 × 192 patches and augment the data with random horizontal and vertical flipping as well as time reversal. We use Adan [39] optimizer for end-to-end training, with the hyperparameters , , and set to 0.98, 0.92, and 0.99, respectively. The training batch size is 8 and the initial learning rate is . We perform 300 epochs using cosine annealing to reduce the learning rate from to . Our model was trained on an NVIDIA GeForce RTX 3090 (manufactured by NVIDIA Corporation, Santa Clara, CA, USA) with PyTorch 1.12.0, taking about 1 week.
4.3. Comparisons with State-of-the-Art Methods
4.3.1. Quantitative Comparison
4.3.2. Qualitative Comparison
4.4. Ablation Study
- Ablation Study on Model Layer Structure and Channel Selection. We investigate the effects of PSTAT’s layer structure and initial channel on the model performance, as shown in Table 4. Specifically, we set the number of TRBs in each PSTAT layer to 1 or 2, and each structure corresponds to an initial channel of 16 or 32, respectively. Based on the results in Table 4, we can see that our model structure is scalable and the model performs best when the number of TRBs is 2 and the initial channel dimension is 32. Furthermore, as the number of TRBs decreases, the performance of the model decreases, which indicates the effectiveness of TRBs in the interpolation task.
- Ablation Study on PSTAT Structure and TRB Structure. In the PSTAT layer, in order to more thoroughly verify the effectiveness and scalability of PSTA, we use a regular convolutional layer instead of PSTA, and the results are shown in Table 5. In the same number of TRBs, the PSNR for the TRB with regular convolution is consistently lower than for the TRB with PSTA. When comparing the results of TRB = 2 with convolution to TRB = 1 with one PSTA block, it was found that the PSNR of the former was lower than the latter. These results indicate that PSTA is superior to convolution and that PSTA is more suitable for modeling inter-frame motion. Additionally, as the number of PSTA blocks increases, model performance improves, further demonstrating our model’s scalability.
- Ablation Study on PSTA. We perform an ablation study on the PSTA structural design in order to analyze the effect of SA and TA on the model performance, and the results are shown in Table 5. We first evaluate the performance impact of SA and TA on the model. In particular, the PSNR of the model is 35.61 dB when we use a convolutional layer instead of SA, and the PSNR of the model without TA is 36.19 dB. This result shows that the performance of the model degrades when either SA or TA is missing, and SA has a greater impact on our model. This indicates that our proposed PSTA is very effective and it allows our model to aggregate both inter-frame and intra-frame information without increasing the computational overhead. It also shows the superiority of the parallel mechanism. In addition, we remove the self-attention mechanism and convolution from the SA, respectively, and the results show that both mechanisms affect the performance of the model, and the lack of the self-attention mechanism has a greater impact on the model performance than the lack of convolution. This result suggests that the self-attention mechanism is more suitable than convolution for modeling large motions for the VFI task, and indirectly shows that the Transformer-based model proposed in this paper outperforms the CNN-based model.
- Ablation Study on Model Architecture Design. For CE-Net and MPFS-Net, we conduct a simple comparison experiment, and we construct three model structures, namely: the model without CE-Net, the model without MPFS-Net, and the model without CE-Net and MPFS-Net. As shown by the results in Table 5, when the model lacks CE-Net and MPFS-Net, the model performs poorly, especially when both are missing. These results show that CE-Net and MPFS-Net are beneficial for our model and can fully realize the performance of the Transformer-based structure. They also enable our model to learn multi-scale information and synthesize high-quality video frames.
- Ablation Study on Conv Scheme in CE-Net. To investigate the strategy of using the 1 × 1 convolution in CE-Net, we use a simple model structure (1 TRB + 1 PSTA block). We then replace the convolution kernel with a 3 × 3 kernel and evaluate the model for CE-Net with different sizes of convolution kernels. Comparison of the results in the last two rows of Table 5 show that the large-size convolutional kernel performs poorly in our model and fails to focus on more details. In contrast, the 1 × 1 convolution can aggregate more contextual information and is more suitable for CE-Net.
- Visual Ablation Study. Besides quantitative comparisons of ablation studies, we also perform qualitative comparisons of PSTA, CE-Net, and MPFS-Net on the Middlebury [9], as shown in Figure 6. In particular, the model without SA generates significantly blurrier intermediate frames. In the second row, all models generate tennis balls with sharp edges, which indicates that the Transformer structure is able to robustly model the similarity of long-range pixels for objects with regular shapes. Furthermore, the full model is able to generate sharper and more detailed intermediate frames compared to the version without key modules. This comparison clearly shows the important contribution of each module in improving the quality of frame synthesis. However, in the second row, there is still improvement in the performance of all the models for human body motions, especially finger joints, which will be the focus of our future research.
5. Limitations and Future Work
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Wu, C.Y.; Singhal, N.; Krahenbuhl, P. Video compression through image interpolation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 416–431. [Google Scholar]
- Kim, S.Y.; Oh, J.; Kim, M. Fisr: Deep joint frame interpolation and super-resolution with a multi-scale temporal loss. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11278–11286. [Google Scholar]
- Haris, M.; Shakhnarovich, G.; Ukita, N. Space-time-aware multi-resolution video enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2859–2868. [Google Scholar]
- Jiang, H.; Sun, D.; Jampani, V.; Yang, M.H.; Learned-Miller, E.; Kautz, J. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9000–9008. [Google Scholar]
- Bao, W.; Lai, W.S.; Ma, C.; Zhang, X.; Gao, Z.; Yang, M.H. Depth-aware video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3703–3712. [Google Scholar]
- Huang, Z.; Zhang, T.; Heng, W.; Shi, B.; Zhou, S. Real-time intermediate flow estimation for video frame interpolation. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 624–642. [Google Scholar]
- Lee, H.; Kim, T.; Chung, T.Y.; Pak, D.; Ban, Y.; Lee, S. AdaCoF: Adaptive collaboration of flows for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5315–5324. [Google Scholar]
- Xue, T.; Chen, B.; Wu, J.; Wei, D.; Freeman, W. Video enhancement with task-oriented flow. Int. J. Comput. Vis. 2018, 127, 1106–1125. [Google Scholar] [CrossRef]
- Baker, S.; Scharstein, D.; Lewis, J.; Roth, S.; Black, M.J.; Szeliski, R. A database and evaluation methodology for optical flow. Int. J. Comput. Vis. 2007, 92, 1–31. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10012–10022. [Google Scholar]
- Im, S.K.; Chan, K.H. Distributed Spatial Transformer for Object Tracking in Multi-Camera. In Proceedings of the 2023 25th International Conference on Advanced Communication Technology (ICACT), Pyeongchang, Republic of Korea, 19–22 February 2023; pp. 122–125. [Google Scholar]
- Thawakar, O.; Narayan, S.; Cao, J.; Cholakkal, H.; Anwer, R.M.; Khan, M.H.; Khan, S.; Felsberg, M.; Khan, F.S. Video instance segmentation via multi-scale spatio-temporal split attention transformer. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 666–681. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Lu, L.; Wu, R.; Lin, H.; Lu, J.; Jia, J. Video frame interpolation with transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3532–3542. [Google Scholar]
- Park, J.; Lee, C.; Kim, C.S. Asymmetric bilateral motion estimation for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–25 June 2021; pp. 14539–14548. [Google Scholar]
- Kalluri, T.; Pathak, D.; Chandraker, M.; Tran, D. Flavr: Flow-agnostic video representations for fast frame interpolation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Vancouver, BC, Canada, 18–22 June 2023; pp. 2071–2082. [Google Scholar]
- Sim, H.; Oh, J.; Kim, M. Xvfi: Extreme video frame interpolation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 14489–14498. [Google Scholar]
- Niklaus, S.; Liu, F. Softmax splatting for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5437–5446. [Google Scholar]
- Ding, T.; Liang, L.; Zhu, Z.; Zharkov, I. Cdfi: Compression-driven network design for frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8001–8011. [Google Scholar]
- Zhang, D.; Huang, P.; Ding, X.; Li, F.; Zhu, W.; Song, Y.; Yang, G. L2BEC2: Local Lightweight Bidirectional Encoding and Channel Attention Cascade for Video Frame Interpolation. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–19. [Google Scholar]
- Ding, X.; Huang, P.; Zhang, D.; Liang, W.; Li, F.; Yang, G.; Liao, X.; Li, Y. MSEConv: A Unified Warping Framework for Video Frame Interpolation. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2024. [Google Scholar] [CrossRef]
- Ning, X.; Li, Y.; Feng, Z.; Liu, J.; Ding, Y. An Efficient Multi-Scale Attention Feature Fusion Network for 4k Video Frame Interpolation. Electronics 2024, 13, 1037. [Google Scholar] [CrossRef]
- Niklaus, S.; Mai, L.; Liu, F. Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE international conference on computer vision, Venice, Italy, 22–29 October 2017; pp. 261–270. [Google Scholar]
- Cheng, X.; Chen, Z. Video frame interpolation via deformable separable convolution. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10607–10614. [Google Scholar]
- Cheng, X.; Chen, Z. Multiple video frame interpolation via enhanced deformable separable convolution. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7029–7045. [Google Scholar] [CrossRef] [PubMed]
- Im, S.K.; Chan, K.H. Local feature-based video captioning with multiple classifier and CARU-attention. IET Image Proc. 2024; early view. [Google Scholar] [CrossRef]
- Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9308–9316. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. SwinIR: Image restoration using Swin Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1833–1844. [Google Scholar]
- Shi, Z.; Xu, X.; Liu, X.; Chen, J.; Yang, M.H. Video frame interpolation transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17482–17491. [Google Scholar]
- Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
- Lu, Z.; Li, J.; Liu, H.; Huang, C.; Zhang, L.; Zeng, T. Transformer for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 457–466. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Fourure, D.; Emonet, R.; Fromont, E.; Muselet, D.; Tremeau, A.; Wolf, C. Residual conv-deconv grid network for semantic segmentation. arXiv 2017, arXiv:1707.07958. [Google Scholar]
- Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
- Choi, M.; Kim, H.; Han, B.; Xu, N.; Lee, K.M. Channel attention is all you need for video frame interpolation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10663–10671. [Google Scholar]
- Bao, W.; Lai, W.S.; Zhang, X.; Gao, Z.; Yang, M.H. Memc-net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 933–948. [Google Scholar] [CrossRef] [PubMed]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
- Xie, X.; Zhou, P.; Li, H.; Lin, Z.; Yan, S. Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models. arXiv 2022, arXiv:2208.06677. [Google Scholar]
- Park, J.; Ko, K.; Lee, C.; Kim, C.S. Bmbc: Bilateral motion estimation with bilateral cost volume for video interpolation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 109–125. [Google Scholar]
- Kong, L.; Jiang, B.; Luo, D.; Chu, W.; Huang, X.; Tai, Y.; Wang, C.; Yang, J. Ifrnet: Intermediate feature refine network for efficient frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1969–1978. [Google Scholar]
- Jin, X.; Wu, L.; Chen, J.; Chen, Y.; Koo, J.; Hahm, C.h. A unified pyramid recurrent network for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 1578–1587. [Google Scholar]
- Jin, X.; Wu, L.; Shen, G.; Chen, Y.; Chen, J.; Koo, J.; Hahm, C.h. Enhanced bi-directional motion estimation for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 5049–5057. [Google Scholar]
- Hu, P.; Niklaus, S.; Sclaroff, S.; Saenko, K. Many-to-many splatting for efficient video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3553–3562. [Google Scholar]
- Zhang, G.; Zhu, Y.; Wang, H.; Chen, Y.; Wu, G.; Wang, L. Extracting motion and appearance via inter-frame attention for efficient video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 5682–5692. [Google Scholar]
Module | Step | FLOPs |
---|---|---|
SA | Simple Mapping | |
Convolution + Attention | ||
TA | All |
Methods | Vimeo90K | UCF101 | M.B. | SNU-FILM | #P | #R | |||
---|---|---|---|---|---|---|---|---|---|
Easy | Medium | Hard | Extreme | ||||||
DAIN [5] | 34.71/0.9756 | 34.99/0.9683 | 2.04 | 39.73/0.9902 | 35.46/0.9780 | 30.17/0.9335 | 25.09/0.8584 | 24 | 151 |
RIFE [6] | 35.61/0.9780 | 35.28/0.9690 | 1.96 | 40.06/0.9907 | 35.75/0.9789 | 30.10/0.9330 | 24.84/0.8534 | 9.8 | 12 |
RIFE-Large [6] | 36.10/0.9801 | 35.29/0.9693 | 1.94 | 40.02/0.9906 | 35.92/0.9791 | 30.49/0.9364 | 25.24/0.8621 | 9.8 | 80 |
AdaCoF [7] | 34.47/0.9730 | 34.90/0.9680 | 2.31 | 39.80/0.9900 | 35.05/0.9754 | 29.46/0.9244 | 24.31/0.8439 | 22.9 | 30 |
ToFlow [8] | 33.73/0.9682 | 34.58/0.9667 | 2.15 | 39.08/0.9890 | 34.39/0.9740 | 28.44/0.9180 | 23.39/0.8310 | 1.1 | 84 |
ABME [15] | 36.18/0.9805 | 35.38/0.9698 | 2.01 | 39.59/0.9901 | 35.77/0.9789 | 30.58/0.9364 | 25.42/0.8639 | 18.1 | 277 |
XVFIv [17] | 35.07/0.9681 | 35.18/0.9519 | - | 39.78/0.9840 | 35.37/0.9641 | 29.91/0.8935 | 24.73/0.7782 | 5.5 | 98 |
SoftSplat [18] | 36.10/0.9700 | 35.39/0.9520 | 1.81 | - | - | - | - | - | - |
CDFI [19] | 35.17/0.9640 | 35.21/0.9500 | 1.98 | 40.12/0.9906 | 35.51/0.9778 | 29.73/0.9277 | 24.53/0.8476 | 5 | 172 |
SepConv [23] | 33.79/0.9702 | 34.78/0.9669 | 2.27 | 39.41/0.9900 | 34.97/0.9762 | 29.36/0.9253 | 24.31/0.8448 | 21.6 | 200 |
EDSC [25] | 34.84/0.9750 | 35.13/0.9680 | 2.02 | 40.01/0.9900 | 35.37/0.9780 | 29.59/0.9260 | 24.39/0.8430 | 8.9 | 46 |
CAIN [36] | 34.65/0.9730 | 34.91/0.9690 | 2.28 | 39.89/0.9900 | 35.61/0.9776 | 29.90/0.9292 | 24.78/0.8507 | 42.8 | 37 |
BMBC [40] | 35.01/0.9764 | 35.15/0.9689 | 2.04 | 39.90/0.9902 | 35.31/0.9774 | 29.33/0.9270 | 23.92/0.8432 | 11 | 822 |
IFRNet [41] | 35.80/0.9794 | 35.29/0.9693 | - | 40.03/0.9905 | 35.94/0.9793 | 30.41/0.9358 | 25.05/0.8587 | 5 | 22 |
UPR-Net [42] | 36.03/0.9801 | 35.41/0.9698 | - | 40.37/0.9910 | 36.16/0.9797 | 30.67/0.9365 | 25.49/0.8627 | 1.7 | 42 |
UPR-Net-large [42] | 36.28/0.9810 | 35.43/0.9700 | - | 40.42/0.9911 | 36.24/0.9799 | 30.81/0.9370 | 25.58/0.8636 | 3.7 | 62 |
EBME-H* [43] | 36.19/0.9807 | 35.41/0.9697 | - | 40.28/0.9910 | 36.07/0.9797 | 30.64/0.9368 | 25.40/0.8634 | 3.9 | 82 |
ours-small | 36.05/0.9796 | 35.23/0.9695 | 1.95 | 39.91/0.9906 | 35.87/0.9794 | 30.68/0.9373 | 25.46/0.8629 | 8.7 | 36 |
ours | 36.37/0.9811 | 35.44/0.9700 | 1.91 | 40.22/0.9908 | 36.25/0.9801 | 30.85/0.9375 | 25.62/0.8639 | 17.2 | 82 |
Method | 4× | 8× | ||
---|---|---|---|---|
HD (720p) | HD (1080p) | XTest-2K | XTest-4K | |
DAIN [5] | 30.25 | - | 29.33 | 26.78 |
RIFEm [6] | 31.87 | 34.25 | 31.43 | 30.58 |
ABME [15] | 31.43 | 33.22 | 30.65 | 30.16 |
IFRNet [41] | 31.85 | 33.19 | 31.53 | 30.46 |
M2M [44] | 31.94 | 33.45 | 32.13 | 30.88 |
EMA-VFI-small [45] | 32.17 | 34.65 | 31.89 | 30.89 |
EMA-VFI [45] | 32.38 | 35.28 | 32.85 | 31.46 |
Ours | 32.21 | 34.85 | 32.25 | 31.09 |
Architecture | Channel | Vimeo90K | UCF101 | ||
---|---|---|---|---|---|
PSNR | SSIM | PSNR | SSIM | ||
1-1-1-1 | 16 | 35.46 | 0.9774 | 34.60 | 0.9626 |
2-2-2-2 | 16 | 36.05 | 0.9796 | 35.23 | 0.9695 |
1-1-1-1 | 32 | 36.11 | 0.9801 | 35.31 | 0.9698 |
2-2-2-2 | 32 | 36.37 | 0.9811 | 35.44 | 0.9700 |
Setting | Vimeo90K | UCF101 | ||
---|---|---|---|---|
PSNR | SSIM | PSNR | SSIM | |
PSTAT Layer Structure Design | ||||
1 TRB w/Conv 3 × 3 | 34.21 | 0.9723 | 34.07 | 0.9527 |
1 TRB w/1 PSTA block | 35.59 | 0.9794 | 34.85 | 0.9633 |
1 TRB w/2 PSTA blocks | 36.11 | 0.9801 | 35.31 | 0.9698 |
2 TRB w/Conv 3 × 3 | 35.34 | 0.9692 | 34.41 | 0.9628 |
2 TRB w/1 PSTA block | 36.30 | 0.9811 | 35.34 | 0.9699 |
2 TRB w/2 PSTA blocks | 36.37 | 0.9811 | 35.44 | 0.9700 |
PSTA Design | ||||
Ours w/o SA | 35.61 | 0.9779 | 34.90 | 0.9635 |
Ours w/o TA | 36.19 | 0.9802 | 35.32 | 0.9698 |
SA w/o Attention | 35.64 | 0.9782 | 35.05 | 0.9686 |
SA w/o Conv | 35.94 | 0.9801 | 35.26 | 0.9691 |
Model Architecture Design | ||||
Ours w/o MPFS-Net + CE-Net | 35.58 | 0.9757 | 34.82 | 0.9633 |
Ours w/o CE-Net | 36.12 | 0.9799 | 35.33 | 0.9698 |
Ours w/o MPFS-Net | 35.93 | 0.9791 | 35.24 | 0.9690 |
CE-Net Conv Scheme (Model w/1 TRB + 1 PSTA block) | ||||
CE-Net - Conv 3 × 3 | 35.52 | 0.9792 | 34.68 | 0.9629 |
CE-Net - Conv 1 × 1 | 35.59 | 0.9794 | 34.85 | 0.9633 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ning, X.; Cai, F.; Li, Y.; Ding, Y. Parallel Spatio-Temporal Attention Transformer for Video Frame Interpolation. Electronics 2024, 13, 1981. https://doi.org/10.3390/electronics13101981
Ning X, Cai F, Li Y, Ding Y. Parallel Spatio-Temporal Attention Transformer for Video Frame Interpolation. Electronics. 2024; 13(10):1981. https://doi.org/10.3390/electronics13101981
Chicago/Turabian StyleNing, Xin, Feifan Cai, Yuhang Li, and Youdong Ding. 2024. "Parallel Spatio-Temporal Attention Transformer for Video Frame Interpolation" Electronics 13, no. 10: 1981. https://doi.org/10.3390/electronics13101981
APA StyleNing, X., Cai, F., Li, Y., & Ding, Y. (2024). Parallel Spatio-Temporal Attention Transformer for Video Frame Interpolation. Electronics, 13(10), 1981. https://doi.org/10.3390/electronics13101981