LightVSR: A Lightweight Video Super-Resolution Model with Multi-Scale Feature Aggregation
Abstract
:1. Introduction
- (1)
- We propose a lightweight VSR model that utilizes bidirectional optical flow for effective motion compensation, enabling efficient video frame reconstruction even on low-power devices.
- (2)
- We design head-tail convolution, a novel convolutional operation that significantly reduces redundant computations while ensuring the effective representation of critical features across the entire feature tensor.
- (3)
- We design a multi-input attention mechanism that integrates features from multiple sources by utilizing channel-wise attention to achieve comprehensive and efficient feature fusion.
- (4)
- Combined head-tail convolution and multi-input attention mechanisms, we further design a feature aggregation module that leverages cross-layer shortcut connections to enhance computational efficiency while maintaining model performance.
2. Related Work
3. Feature Aggregation Module
3.1. Head-Tail Convolution
3.2. Multi-Input Attention
3.3. Feature Aggregation
4. Results
4.1. Experiment
4.2. Ablation Study on HT-Convolution
4.3. Comparison of Feature Fusion Methods
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Jiang, Y.; Nawała, J.; Feng, C.; Zhang, F.; Zhu, X.; Sole, J.; Bull, D. RTSR: A Real-Time Super-Resolution Model for AV1 Compressed Content. arXiv 2024, arXiv:2411.13362. [Google Scholar]
- Sun, J.; Yuan, Q.; Shen, H.; Li, J.; Zhang, L. A Single-Frame and Multi-Frame Cascaded Image Super-Resolution Method. Sensors 2024, 24, 5566. [Google Scholar] [CrossRef] [PubMed]
- Ko, H.-k.; Park, D.; Park, Y.; Lee, B.; Han, J.; Park, E. Sequence Matters: Harnessing Video Models in Super-Resolution. arXiv 2024, arXiv:2412.11525. [Google Scholar]
- Li, Y.; Yang, X.; Liu, W.; Jin, X.; Jia, X.; Lai, Y.; Liu, H.; Rosin, P.L.; Zhou, W. TINQ: Temporal Inconsistency Guided Blind Video Quality Assessment. arXiv 2024, arXiv:2412.18933. [Google Scholar]
- Wen, Y.; Zhao, Y.; Liu, Y.; Jia, F.; Wang, Y.; Luo, C.; Zhang, C.; Wang, T.; Sun, X.; Zhang, X. Panacea: Panoramic and controllable video generation for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 6902–6912. [Google Scholar]
- Feng, R.; Li, C.; Loy, C.C. Kalman-inspired feature propagation for video face super-resolution. In Proceedings of the European Conference on Computer Vision, Paris, France, 26–27 March 2025; pp. 202–218. [Google Scholar]
- Wang, Q.; Yin, Q.; Huang, Z.; Jiang, W.; Su, Y.; Ma, S.; Zhang, J. Compressed Domain Prior-Guided Video Super-Resolution for Cloud Gaming Content. arXiv 2025, arXiv:2501.01773. [Google Scholar]
- Ranjan, A.; Black, M.J. Optical Flow Estimation using a Spatial Pyramid Network. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2720–2729. [Google Scholar]
- Shi, W.Z.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z.H. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
- Keys, R. Cubic convolution interpolation for digital image processing. IEEE Trans.Acoust. Speech Signal Process. 1981, 29, 1153–1160. [Google Scholar] [CrossRef]
- Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
- Wang, X.; Chan, K.C.; Yu, K.; Dong, C.; Change Loy, C. Edvr: Video restoration with enhanced deformable convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
- Li, W.; Tao, X.; Guo, T.; Qi, L.; Lu, J.; Jia, J. Mucan: Multi-correspondence aggregation network for video super-resolution. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part X 16. pp. 335–351. [Google Scholar]
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
- Sajjadi, M.S.; Vemulapalli, R.; Brown, M. Frame-recurrent video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6626–6634. [Google Scholar]
- Chu, M.; Xie, Y.; Leal-Taixé, L.; Thuerey, N. Temporally coherent gans for video super-resolution (tecogan). arXiv 2018, arXiv:1811.09393. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar]
- Chan, K.C.; Wang, X.; Yu, K.; Dong, C.; Loy, C.C. Basicvsr: The search for essential components in video super-resolution and beyond. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 4947–4956. [Google Scholar]
- Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 1833–1844. [Google Scholar]
- Rota, C.; Buzzelli, M.; van de Weijer, J. Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models. arXiv 2023, arXiv:2311.15908. [Google Scholar]
- Xu, K.; Yu, Z.; Wang, X.; Mi, M.B.; Yao, A. Enhancing Video Super-Resolution via Implicit Resampling-based Alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2546–2555. [Google Scholar]
- Lu, Y.; Wang, Z.; Liu, M.; Wang, H.; Wang, L. Learning spatial-temporal implicit neural representations for event-guided video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1557–1567. [Google Scholar]
- Yi, P.; Wang, Z.; Jiang, K.; Jiang, J.; Lu, T.; Tian, X.; Ma, J. Omniscient video super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 4429–4438. [Google Scholar]
- Jiang, L.; Wang, N.; Dang, Q.; Liu, R.; Lai, B. PP-MSVSR: Multi-stage video super-resolution. arXiv 2021, arXiv:2112.02828. [Google Scholar]
- Wang, X.; Yang, X.; Li, H.; Li, T. FDDCC-VSR: A lightweight video super-resolution network based on deformable 3D convolution and cheap convolution. Vis. Comput. 2024, 1–13. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Nah, S.; Baik, S.; Hong, S.; Moon, G.; Son, S.; Timofte, R.; Mu Lee, K. Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
- Xue, T.; Chen, B.; Wu, J.; Wei, D.; Freeman, W.T. Video enhancement with task-oriented flow. Int. J. Comput. Vis. 2019, 127, 1106–1125. [Google Scholar] [CrossRef]
- Charbonnier, P.; Blanc-Feraud, L.; Aubert, G.; Barlaud, M. Two deterministic half-quadratic regularization algorithms for computed imaging. In Proceedings of the 1st International Conference on Image Processing, Austin, TX, USA, 13–16 November 1994; pp. 168–172. [Google Scholar]
- Caballero, J.; Ledig, C.; Aitken, A.; Acosta, A.; Totz, J.; Wang, Z.; Shi, W. Real-time video super-resolution with spatio-temporal networks and motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4778–4787. [Google Scholar]
- Tao, X.; Gao, H.; Liao, R.; Wang, J.; Jia, J. Detail-revealing deep video super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4472–4480. [Google Scholar]
- Jo, Y.; Oh, S.W.; Kang, J.; Kim, S.J. Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3224–3232. [Google Scholar]
- Haris, M.; Shakhnarovich, G.; Ukita, N. Recurrent back-projection network for video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 3897–3906. [Google Scholar]
- Yi, P.; Wang, Z.; Jiang, K.; Jiang, J.; Ma, J. Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3106–3115. [Google Scholar]
- Isobe, T.; Li, S.; Jia, X.; Yuan, S.; Slabaugh, G.; Xu, C.; Li, Y.-L.; Wang, S.; Tian, Q. Video super-resolution with temporal group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8008–8017. [Google Scholar]
- Fuoli, D.; Gu, S.; Timofte, R. Efficient video super-resolution through recurrent latent space propagation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 3476–3485. [Google Scholar]
- Isobe, T.; Jia, X.; Gu, S.; Li, S.; Wang, S.; Tian, Q. Video super-resolution with recurrent structure-detail network. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XII 16. pp. 645–660. [Google Scholar]
- Isobe, T.; Zhu, F.; Jia, X.; Wang, S. Revisiting temporal modeling for video super-resolution. arXiv 2020, arXiv:2008.05765. [Google Scholar]
- Ying, X.; Wang, L.; Wang, Y.; Sheng, W.; An, W.; Guo, Y. Deformable 3d convolution for video super-resolution. IEEE Signal Process. Lett. 2020, 27, 1500–1504. [Google Scholar] [CrossRef]
- Geng, Z.; Liang, L.; Ding, T.; Zharkov, I. Rstt: Real-time spatial temporal transformer for space-time video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17441–17451. [Google Scholar]
- Wang, H.; Xiang, X.; Tian, Y.; Yang, W.; Liao, Q. Stdan: Deformable attention network for space-time video super-resolution. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 10616. [Google Scholar] [CrossRef] [PubMed]
BI Degradation | BD Degradation | ||||||||
---|---|---|---|---|---|---|---|---|---|
Params | Runtime | Fps | REDS4 | Vimeo-90K | Vid4 | UDM10 | Vimeo-90K | Vid4 | |
(1981) Bicubic [10] | - | - | - | 26.14/0.7292 | 31.32/0.8684 | 23.78/0.6347 | 28.47/0.8253 | 31.30/0.8687 | 21.80/0.5246 |
(2017) VESPCN [31] | - | - | - | - | - | 25.35/0.7557 | - | - | - |
(2017) SPMC [32] | - | - | - | - | - | 25.88/0.7752 | - | - | - |
(2019) TOFlow [29] | - | - | - | 27.98/0.7990 | 33.08/0.9054 | 25.89/0.7651 | 36.26/0.9438 | 34.62/0.9212 | - |
(2018) FRVSR [15] | 5.1 | 137 | 7.30 | - | - | - | 37.09/0.9522 | 35.64/0.9319 | 26.69/0.8103 |
(2018) DUF [33] | 5.8 | 974 | 1.03 | 28.63/0.8251 | - | - | 38.48/0.9605 | 36.87/0.9447 | 27.38/0.8329 |
(2019) RBPN [34] | 12.2 | 1507 | 0.66 | 30.09/0.8590 | 37.07/0.9435 | 27.12/0.8180 | 38.66/0.9596 | 37.20/0.9458 | |
(2019) EDVR-M [12] | 3.3 | 118 | 8.47 | 30.53/0.8699 | 37.09/0.9446 | 27.10/0.8186 | 39.40/0.9663 | 37.33/0.9484 | 27.45/0.8406 |
(2019) EDVR [12] | 20.6 | 378 | 2.65 | 31.09/0.8800 | 37.61/0.9489 | 27.35/0.8264 | 39.89/0.9686 | 37.81/0.9523 | 27.85/0.8503 |
(2019) PFNL [35] | 3.0 | 295 | 3.39 | 29.63/0.8502 | 36.14/0.9363 | 26.73/0.8029 | 38.74/0.9627 | - | 27.16/0.8355 |
(2020) MuCAN [13] | - | - | - | 30.88/0.8750 | 37.32/0.9465 | - | - | - | - |
(2020) TGA [36] | 5.8 | - | - | - | - | - | - | 37.59/0.9516 | 27.63/0.8423 |
(2019) RLSP [37] | 4.2 | 49 | 20.41 | - | - | - | 38.48/0.9606 | 36.49/0.9403 | 27.48/0.8388 |
(2020) RSDN [38] | 6.2 | 94 | 10.64 | - | - | - | 39.35/0.9653 | 37.23/0.9471 | 27.92/0.8505 |
(2020) RRN [39] | 3.4 | 45 | 22.22 | - | - | - | 38.96/0.9644 | - | 27.69/0.8488 |
(2020) D3Dnet [40] | 2.6 | - | - | 30.51/0.8657 | 35.65/0.9331 | 26.52/0.7993 | - | - | - |
(2022) RSTT [41] | 4.5 | 38 | 26.32 | 30.11/0.8613 | 36.58/0.9381 | 26.29/0.7941 | - | - | - |
(2023) STDAN [42] | 8.3 | 72 | 13.89 | 29.98/0.8613 | 35.70/0.9387 | 26.28/0.8041 | - | - | - |
(2024) FDDCC-VSR [25] | 1.8 | - | - | 30.55/0.8663 | 35.73/0.9357 | 26.79/0.8334 | - | - | - |
LightVSR (ours) | 3.5 | 35 | 28.57 | 30.71/0.8780 | 36.69/0.9406 | 26.95/0.8188 | 39.25/0.9656 | 36.91/0.9444 | 27.37/0.8360 |
Params | REDS4 | Vimeo-90k-T | Vid4 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PSNR ↑ | SSIM ↑ | MAE ↓ | MSE ↓ | PSNR ↑ | SSIM ↑ | MAE ↓ | MSE ↓ | PSNR ↑ | SSIM ↑ | MAE ↓ | MSE ↓ | ||
HT-Conv Applied | 3.55 M | 30.72 | 0.8784 | 0.0176 | 0.0009 | 35.99 | 0.9366 | 0.0126 | 0.0007 | 26.84 | 0.8121 | 0.0355 | 0.0035 |
HT-Conv Not Applied | 3.41 M | 30.13 | 0.8640 | 0.0188 | 0.0011 | 35.52 | 0.9309 | 0.0133 | 0.0008 | 26.29 | 0.7902 | 0.0377 | 0.0040 |
REDS4 | Vimeo-90k-T | Vid4 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
PSNR ↑ | SSIM ↑ | MAE ↓ | MSE ↓ | PSNR ↑ | SSIM ↑ | MAE ↓ | MSE ↓ | PSNR ↑ | SSIM ↑ | MAE ↓ | MSE ↓ | |
Attention mechanism | 30.71 | 0.8780 | 0.0176 | 0.0009 | 36.00 | 0.9368 | 0.0125 | 0.0007 | 26.95 | 0.8188 | 0.0349 | 0.0034 |
Concatenate | 30.65 | 0.8769 | 0.0177 | 0.0010 | 36.03 | 0.9369 | 0.0147 | 0.0007 | 26.93 | 0.8188 | 0.0351 | 0.0034 |
Tensor addition | 30.61 | 0.8764 | 0.0178 | 0.0010 | 35.93 | 0.9359 | 0.0126 | 0.0007 | 26.80 | 0.8143 | 0.0354 | 0.0035 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Huang, G.; Li, N.; Liu, J.; Zhang, M.; Zhang, L.; Li, J. LightVSR: A Lightweight Video Super-Resolution Model with Multi-Scale Feature Aggregation. Appl. Sci. 2025, 15, 1506. https://doi.org/10.3390/app15031506
Huang G, Li N, Liu J, Zhang M, Zhang L, Li J. LightVSR: A Lightweight Video Super-Resolution Model with Multi-Scale Feature Aggregation. Applied Sciences. 2025; 15(3):1506. https://doi.org/10.3390/app15031506
Chicago/Turabian StyleHuang, Guanglun, Nachuan Li, Jianming Liu, Minghe Zhang, Li Zhang, and Jun Li. 2025. "LightVSR: A Lightweight Video Super-Resolution Model with Multi-Scale Feature Aggregation" Applied Sciences 15, no. 3: 1506. https://doi.org/10.3390/app15031506
APA StyleHuang, G., Li, N., Liu, J., Zhang, M., Zhang, L., & Li, J. (2025). LightVSR: A Lightweight Video Super-Resolution Model with Multi-Scale Feature Aggregation. Applied Sciences, 15(3), 1506. https://doi.org/10.3390/app15031506