Feature Maps Need More Attention: A Spatial-Channel Mutual Attention-Guided Transformer Network for Face Super-Resolution
Abstract
:1. Introduction
- We devise a spatial-channel mutual attention-guided transformer network for face super-resolution. To the best of our knowledge, this is the first paper to explore the potential of inner feature maps in reconstructing plausible face images in the transformer-based FSR area.
- We carefully design a Spatial-Channel Mutual Attention-guided Transformer Module (SCATM) to extract more detailed features by enhancing the relationship between inner feature maps. Thanks to its powerful modeling ability, both local facial details and global facial structures can be fully explored and utilized.
- We propose an elaborately designed Multi-scale Feature Fusion Module (MFFM) to fuse multi-scale features during the reconstruction process. This module is crucial in enabling our method to acquire a wide range of features, which in turn enhances the quality of the restoration performance.
2. Proposed Method
2.1. Revisiting Feature Maps and Feature Spaces
2.2. Overview
2.3. Spatial-Channel Mutual Attention-Guided Transformer Module (SCATM)
2.3.1. Spatial-Channel Mutual Attention Guiding Block-Top (SCAGB-T)
2.3.2. Spatial-Channel Mutual Attention Guiding Block-Bottom (SCAGB-B)
2.3.3. Channel-Wise Multi-Head Transformer Block (CMTB)
2.4. Multi-Scale Feature Fusion Module (MFFM)
3. Experiments
3.1. Dataset and Metrics
3.2. Implementation Details
3.3. Ablation Studies
- (a)
- The performance of SCATM-T without any components (i.e., removing SCATM-T) decreased dramatically. This is because the proposed model structure is much shallower without the SCATM-T, making it challenging to refine features. Moreover, without the processed fine-detailed feature maps from the SCATM-T, the Multi-scale Feature Fusion operation Module (MFFM) is greatly affected.
- (b)
- The quantitative metrics of SCATM-T with one single component inside are better than the no-component one above, demonstrating that both SCAGB-T and CMTB can enhance the representation ability of the model. However, SCATM-T with only SCAGB-T cannot guide anything, while SCATM-T with only CMTB cannot focus on feature map parts that are rich in features, limiting its performance.
- (c)
- Equipped with both of the carefully designed components SCAGB-T and CMTB, SCATM-T achieves the best performance in terms of all evaluation matrices, which proves that the combination of SCAGB-T and CMTB is complementary and can simultaneously promote both local facial details and global facial structures.
3.4. Comparison with the State-of-the-Art
3.5. JPEG Artifacts Analysis
3.6. Model Complexity Analysis
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Baker, S.; Kanade, T. Hallucinating faces. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France, 28–30 March 2000; pp. 83–88. [Google Scholar]
- Jiang, J.J.; Wang, C.Y.; Liu, X.M.; Ma, J.Y. Deep learning-based face super-resolution: A survey. ACM Comput. Surv. 2023, 55, 1–36. [Google Scholar] [CrossRef]
- Zhang, L.; Wu, X. An edge-guided image interpolation algorithm via directional filtering and data fusion. IEEE Trans. Image Process. 2006, 15, 2226–2238. [Google Scholar] [CrossRef]
- Chakrabarti, A.; Rajagopalan, A.N.; Chellappa, R. Super-resolution of face images using kernel pca-based prior. IEEE Trans. Multimed. 2007, 9, 888–892. [Google Scholar] [CrossRef]
- Jung, C.K.; Jiao, L.C.; Liu, B.; Gong, M.G. Position-patch based face hallucination using convex optimization. IEEE Signal Process. Lett. 2011, 18, 367–370. [Google Scholar] [CrossRef]
- Tappen, M.F.; Liu, C. A bayesian approach to alignment-based image hallucination. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 236–249. [Google Scholar]
- Zhang, K.B.; Gao, X.B.; Tao, D.C.; Li, X.L. Single Image Super-Resolution with Non-Local Means and Steering Kernel Regression. IEEE Trans. Image Process. 2012, 21, 4544–4556. [Google Scholar] [CrossRef]
- Jiang, J.J.; Hu, R.M.; Wang, Z.Y.; Han, Z. Face Super-Resolution via Multilayer Locality-Constrained Iterative Neighbor Embedding and Intermediate Dictionary Learning. IEEE Trans. Image Process. 2014, 23, 4220–4231. [Google Scholar] [CrossRef]
- Wang, Z.; Chen, J.; Hoi, S.C. Deep learning for image super-resolution: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3365–3387. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Pei, Z.; Zeng, T. From beginner to master: A survey for deep learning-based single-image super-resolution. arXiv 2021, arXiv:2109.14335. [Google Scholar]
- Zhou, E.J.; Fan, H.Q.; Cao, Z.M.; Jiang, Y.N.; Yin, Q. Learning face hallucination in the wild. In Proceedings of the Association for the Advancement of Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; pp. 3871–3877. [Google Scholar]
- Cao, Q.X.; Lin, L.; Shi, Y.K.; Liang, X.D.; Li, G.B. Attention-aware face hallucination via deep reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017; pp. 690–698. [Google Scholar]
- Zhang, K.; Zhang, Z.; Cheng, C.W.; Hsu, W.H.; Qiao, Y.; Liu, W.; Zhang, T. Super-identity convolutional neural network for face hallucination. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 183–198. [Google Scholar]
- Huang, H.B.; He, R.; Sun, Z.N.; Tan, T.N. Wavelet domain generative adversarial network for multiscale face hallucination. Int. J. Comput. Vis. 2019, 127, 763–784. [Google Scholar] [CrossRef]
- Wang, C.; Jiang, J.; Zhong, Z.; Liu, X. Spatial-Frequency Mutual Learning for Face Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–24 June 2023; pp. 22356–22366. [Google Scholar]
- Ledig, C.; Theis, L.; Huszar, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.H.; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017; pp. 105–114. [Google Scholar]
- Yang, T.; Ren, P.; Xie, X.; Zhang, L. Gan prior embedded network for blind face restoration in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 672–681. [Google Scholar]
- Zhang, Y.; Yu, X.; Lu, X.; Liu, P. Pro-uigan: Progressive face hallucination from occluded thumbnails. IEEE Trans. Image Process. 2022, 31, 3236–3250. [Google Scholar] [CrossRef]
- Yang, L.B.; Liu, C.; Wang, P.; Wang, S.S.; Ren, P.R.; Ma, S.W.; Gao, W. Hifacegan: Face renovation via collaborative suppression and replenishment. In Proceedings of the ACM International Conference on Multimedia, Dublin, Ireland, 8–11 June 2020; pp. 1551–1560. [Google Scholar]
- Gao, J.; Tang, N.; Zhang, D. A Multi-Scale Deep Back-Projection Backbone for Face Super-Resolution with Diffusion Models. Appl. Sci. 2023, 13, 8110. [Google Scholar] [CrossRef]
- Dou, H.; Chen, C.; Hu, X.Y.; Xuan, Z.X.; Hu, Z.S.; Peng, S.L. Pca-srgan: Incremental orthogonal projection discrimination for face super-resolution. In Proceedings of the ACM International Conference on Multimedia, Dublin, Ireland, 8–11 June 2020; pp. 1891–1899. [Google Scholar]
- Zhang, M.L.; Ling, Q. Supervised pixel-wise GAN for face super-resolution. IEEE Trans. Multimed. 2021, 23, 1938–1950. [Google Scholar] [CrossRef]
- Chen, Y.; Tai, Y.; Liu, X.; Shen, C.; Yang, J. Fsrnet: End-to-end learning face super-resolution with facial priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2492–2501. [Google Scholar]
- Bulat, A.; Tzimiropoulos, G. Super-fan: Integrated facial landmark localization and super-resolution of real-world low-resolution faces in arbitrary poses with GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 109–117. [Google Scholar]
- Hu, X.; Ren, W.; LaMaster, J.; Cao, X.; Li, X.; Li, Z.; Menze, B.; Liu, W. Face super-resolution guided by 3d facial priors. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 763–780. [Google Scholar]
- Ma, C.; Jiang, Z.; Rao, Y.; Lu, J.; Zhou, J. Deep face super-resolution with iterative collaboration between attentive recovery and landmark estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 5569–5578. [Google Scholar]
- Wang, Z.; Zhang, J.; Chen, R.; Wang, W.; Luo, P. Restoreformer: High-quality blind face restoration from undegraded key-value pairs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 512–521. [Google Scholar]
- Wang, Z.D.; Cun, X.D.; Bao, J.M.; Zhou, W.G.; Liu, J.Z.; Li, H.Q. Uformer: A General U-Shaped Transformer for Image Restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17683–17693. [Google Scholar]
- Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
- Li, Z.Z.; Li, G.; Li, T.; Liu, S.; Gao, W. Information-Growth Attention Network for Image Super-Resolution. In Proceedings of the ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 544–552. [Google Scholar]
- Li, C.; Xiao, N. A Face Structure Attention Network for Face Super-Resolution. In Proceedings of the International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 75–81. [Google Scholar]
- Chen, C.; Gong, D.; Wang, H.; Li, Z.; Wong, K.-Y.K. Learning spatial attention for face super-resolution. IEEE Trans. Image Process 2020, 30, 1219–1231. [Google Scholar] [CrossRef]
- Lu, T.; Wang, Y.; Zhang, Y.; Wang, Y.; Wei, L.; Wang, Z.; Jiang, J. Face hallucination via split-attention in split-attention network. In Proceedings of the ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 5501–5509. [Google Scholar]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jegou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 347–357. [Google Scholar]
- Xiong, L.; Zhang, J.; Zheng, X.; Wang, Y. Context Transformer and Adaptive Method with Visual Transformer for Robust Facial Expression Recognition. Appl. Sci. 2024, 14, 1535. [Google Scholar] [CrossRef]
- Shi, A.; Ding, H. Underwater Image Super-Resolution via Dual-aware Integrated Network. Appl. Sci. 2023, 13, 12985. [Google Scholar] [CrossRef]
- Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Montreal, BC, Canada, 19–25 June 2021; pp. 1833–1844. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Guo, Y.; Chen, J.; Wang, J.; Chen, Q.; Cao, J.; Deng, Z.; Xu, Y.; Tan, M. Closed-loop matters: Dual regression networks for single image superresolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 5407–5416. [Google Scholar]
- Gao, G.; Xu, Z.; Li, J.; Yang, J.; Zeng, T.; Qi, G.J. CTCNet: A CNN-Transformer Cooperation Network for Face Image Super-Resolution. IEEE Trans. Image Process 2023, 32, 1978–1991. [Google Scholar] [CrossRef]
- Yang, D.; Wei, Y.; Hu, C.; Yu, X.; Sun, C.; Wu, S.; Zhang, J. Multi-Scale Feature Fusion and Structure-Preserving Network for Face Super-Resolution. Appl. Sci. 2023, 13, 8928. [Google Scholar] [CrossRef]
- Lai, W.S.; Huang, J.B.; Ahuja, N.; Yang, M.H. Deep Laplacian Pyramid Networks for Fast and Accurate Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017; pp. 5835–5843. [Google Scholar]
- Leland, M.; John, H.; James, M. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2020, arXiv:1802.03426. [Google Scholar]
- Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3730–3738. [Google Scholar]
- Le, V.; Brandt, J.; Lin, Z.; Bourdev, L.; Huang, T.S. Interactive facial feature localization. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 679–692. [Google Scholar]
- Roweis, S.T.; Saul, L.K. Nonlinear dimensionality reduction by locally linear embedding. Science 2000, 290, 2323–2326. [Google Scholar] [CrossRef]
- Zhang, Z.; Qi, C.; Asif, M.R. Investigation on Projection Space Pairs in Neighbor Embedding Algorithms. In Proceedings of the IEEE International Conference on Signal Processing, Beijing, China, 12–16 August 2018; pp. 125–128. [Google Scholar]
- Hao, Y.H.; Qi, C. Face Hallucination Based on Modified Neighbor Embedding and Global Smoothness Constraint. IEEE Signal Process. Lett. 2014, 21, 1187–1191. [Google Scholar] [CrossRef]
- Tu, Q.; Li, J.W.; Javaria, I. Locality constraint neighbor embedding via KPCA and optimized reference patch for face hallucination. In Proceedings of the IEEE International Conference on Image Processing, Phoenix, AZ, USA, 25–28 September 2016; pp. 424–428. [Google Scholar]
- Yang, W.; Xia, S.; Liu, J.; Guo, Z. Reference-Guided Deep Super-Resolution via Manifold Localized External Compensation. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 1270–1283. [Google Scholar] [CrossRef]
- Menon, S.; Damian, A.; Hu, S.; Ravi, N.; Rudin, C. PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2434–2442. [Google Scholar]
- Chen, L.; Pan, J.; Jiang, J.; Zhang, J.; Han, Z.; Bao, L. Multi-Stage Degradation Homogenization for Super-Resolution of Face Images With Extreme Degradations. IEEE Trans. Image Process. 2021, 30, 5600–5612. [Google Scholar] [CrossRef]
- Howard, J.; Gugger, S. Deep Learning from Scratch. In Deep Learning for Coders with Fastai and PyTorch; Faucher, C., Hassell, J., Potter, M., Eds.; O’Reilly Media: Sebastopol, CA, USA, 2020; pp. 493–515. [Google Scholar]
- Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 483–499. [Google Scholar]
- Li, X.; Wang, W.; Hu, X.; Yang, J. Selective Kernel Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 510–519. [Google Scholar]
- Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
- Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M. Restormer: Efficient Transformer for High-Resolution Image Restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5718–5729. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
- Sheikh, H.R.; Bovik, A.C. Image information and visual quality. IEEE Trans. Image Process. 2006, 15, 430–444. [Google Scholar] [CrossRef] [PubMed]
- Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.M.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in pytorch. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4–9. [Google Scholar]
- JPEG Artifact Generator. Available online: https://impliedchaos.github.io/artifactor.html (accessed on 1 May 2024).
SCAGB-T | CMTB | PSNR↑ | SSIM↑ | LPIPS↓ |
---|---|---|---|---|
× | × | 27.30 | 0.7833 | 0.2047 |
× | ✓ | 27.55 | 0.7897 | 0.1831 |
✓ | × | 27.54 | 0.7891 | 0.1839 |
✓ | ✓ | 27.63 | 0.7905 | 0.1797 |
SCAGB-B | CMTB | PSNR↑ | SSIM↑ | LPIPS↓ |
---|---|---|---|---|
× | × | 27.50 | 0.7877 | 0.1897 |
× | ✓ | 27.60 | 0.7902 | 0.1811 |
✓ | × | 27.59 | 0.7899 | 0.1817 |
✓ | ✓ | 27.63 | 0.7905 | 0.1797 |
SCATM-B Numbers | PSNR↑ | SSIM↑ | LPIPS↓ |
---|---|---|---|
0 | 27.50 | 0.7877 | 0.1897 |
2 | 27.57 | 0.7900 | 0.1823 |
4 | 27.63 | 0.7905 | 0.1797 |
6 | 27.62 | 0.7903 | 0.1804 |
Approaches | PSNR↑ | SSIM↑ | LPIPS↓ |
---|---|---|---|
Not Applied | 27.54 | 0.7885 | 0.1873 |
Only Add | 27.58 | 0.7892 | 0.1838 |
Only Concat | 27.59 | 0.7895 | 0.1832 |
Our MFFM | 27.63 | 0.7905 | 0.1797 |
Methods | CelebA | Helen | ||||
---|---|---|---|---|---|---|
PSNR↑ | SSIM↑ | LPIPS↓ | PSNR↑ | SSIM↑ | LPIPS↓ | |
Bicubic | 23.44 | 0.6180 | 0.5900 | 23.79 | 0.6739 | 0.5254 |
SRResNet [16] | 26.08 | 0.7502 | 0.2131 | 25.47 | 0.7828 | 0.2308 |
IGAN [30] | 26.99 | 0.7801 | 0.2201 | 26.37 | 0.7996 | 0.2245 |
RCAN [29] | 26.99 | 0.7796 | 0.2249 | 26.39 | 0.7965 | 0.2359 |
SISN [33] | 26.85 | 0.7738 | 0.2337 | 26.33 | 0.7974 | 0.2322 |
SPARNet [32] | 26.95 | 0.7794 | 0.2211 | 26.38 | 0.7953 | 0.2314 |
SwinIR [37] | 27.15 | 0.7850 | 0.2162 | 26.48 | 0.7917 | 0.2413 |
Uformer [28] | 27.33 | 0.7884 | 0.2040 | 26.67 | 0.8009 | 0.2063 |
Ours | 27.63 | 0.7905 | 0.1797 | 26.97 | 0.8069 | 0.1945 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, Z.; Qi, C. Feature Maps Need More Attention: A Spatial-Channel Mutual Attention-Guided Transformer Network for Face Super-Resolution. Appl. Sci. 2024, 14, 4066. https://doi.org/10.3390/app14104066
Zhang Z, Qi C. Feature Maps Need More Attention: A Spatial-Channel Mutual Attention-Guided Transformer Network for Face Super-Resolution. Applied Sciences. 2024; 14(10):4066. https://doi.org/10.3390/app14104066
Chicago/Turabian StyleZhang, Zhe, and Chun Qi. 2024. "Feature Maps Need More Attention: A Spatial-Channel Mutual Attention-Guided Transformer Network for Face Super-Resolution" Applied Sciences 14, no. 10: 4066. https://doi.org/10.3390/app14104066
APA StyleZhang, Z., & Qi, C. (2024). Feature Maps Need More Attention: A Spatial-Channel Mutual Attention-Guided Transformer Network for Face Super-Resolution. Applied Sciences, 14(10), 4066. https://doi.org/10.3390/app14104066