Transformer-Based Cascading Reconstruction Network for Video Snapshot Compressive Imaging
Abstract
:1. Introduction
- (1)
- We propose a Transformer-based Cascading Reconstruction Network to learn multi-scale and long-range spatiotemporal features for video snapshot compressive imaging. The cascading network architecture can also improve the network’s ability to effectively represent the content of complex video scenes effectively.
- (2)
- In order to coordinate the two stages better, we propose a dynamic fusion Transformer to adaptively fuse the features of the two stages so that the second stage can reconstruct more incremental details.
- (3)
- Experiments on simulated and real datasets show that the proposed network can improve the reconstruction quality significantly, and the ablation studies also verify the effectiveness of our network architecture.
2. Related Works
2.1. Video Compressive Snapshot Imaging
2.2. Reconstruction Methods
2.3. Vision Transformer
3. Transformer-Based Cascading Reconstruction Network
3.1. Overall Structure Reconstruction
3.2. Measurement Residual Update
3.3. Incremental Details Reconstruction
4. Network Parameters Learning
Algorithm 1: Transformer-based Cascading Reconstruction Network. |
input: learning rate , batch size , maximal iteration number , regularization parameter , , measurement matrix |
output: network parameter |
|
5. Experiment Results and Analysis
5.1. Datasets and Experimental Setting
5.2. Results on Simulation Datasets
5.3. Results on Real Datasets
5.4. Ablation Studies
5.5. Time Complexity and Parameter Quantity
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Donoho, D.L. Compressed sensing. IEEE Trans. Inf. Theory 2006, 52, 1289–1306. [Google Scholar] [CrossRef]
- Chen, S.S.; Donoho, D.L.; Saunders, M.A. Atomic decomposition by basis pursuit. SIAM Rev. 2001, 43, 129–159. [Google Scholar] [CrossRef]
- Candès, E.J.; Romberg, J.K.; Tao, T. Stable signal recovery from incomplete and inaccurate measurements. Commun. Pure Appl. Math. J. Issued Courant Inst. Math. Sci. 2006, 59, 1207–1223. [Google Scholar] [CrossRef]
- Candès, E.J.; Romberg, J.K.; Tao, T. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 2006, 52, 489–509. [Google Scholar] [CrossRef]
- Li, L.; Fang, Y.; Liu, L.; Peng, H.; Kurths, J.; Yang, Y. Overview of Compressed Sensing: Sensing Model, Reconstruction Algorithm, and Its Applications. Appl. Sci. 2020, 10, 5909. [Google Scholar] [CrossRef]
- Jalali, S.; Yuan, X. Snapshot compressed sensing: Performance bounds and algorithms. IEEE Trans. Inf. Theory 2019, 65, 8005–8024. [Google Scholar] [CrossRef]
- Llull, P.; Liao, X.; Yuan, X.; Yang, J.; Kittle, D.; Carin, L.; Sapiro, G.; Brady, D.J. Coded aperture compressive temporal imaging. Opt. Express 2013, 21, 10526–10545. [Google Scholar] [CrossRef]
- Liu, Y.; Yuan, X.; Suo, J.; Brady, D.J.; Dai, Q. Rank minimization for snapshot compressive imaging. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2990–3006. [Google Scholar] [CrossRef]
- Yang, J.; Liao, X.; Yuan, X.; Llull, P.; Brady, D.J.; Sapiro, G.; Carin, L. Compressive sensing by learning a Gaussian mixture model from measurements. IEEE Trans.-Ions Image Process. 2014, 24, 106–119. [Google Scholar] [CrossRef]
- Krahmer, F.; Kruschel, C.; Sandbichler, M. Total variation minimization in compressed sensing. In Compressed Sensing and Its Applications; Birkhäuser: Cham, Switzerland, 2017; pp. 333–358. [Google Scholar]
- Yang, J.; Yuan, X.; Liao, X.; Llull, P.; Brady, D.J.; Sapiro, G.; Carin, L. Video compressive sensing using Gaussian mixture models. IEEE Trans. Image Process. 2014, 23, 4863–4878. [Google Scholar] [CrossRef]
- Ma, J.; Liu, X.Y.; Shou, Z.; Yuan, X. Deep tensor admm-net for snapshot compressive imaging. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 10223–10232. [Google Scholar]
- Yuan, X. Generalized alternating projection based total variation minimization for compressive sensing. In Proceedings of the International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25 September 2016; pp. 2539–2543. [Google Scholar]
- Wei, Z.; Zhang, J.; Xu, Z.; Liu, Y. Optimization Methods of Compressively Sensed Image Reconstruction Based on Single-Pixel Imaging. Appl. Sci. 2020, 10, 3288. [Google Scholar] [CrossRef]
- Cheng, Z.; Chen, B.; Liu, G.; Zhang, H.; Lu, R.; Wang, Z.; Yuan, X. Memory-efficient network for large-scale video compressive sensing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16246–16255. [Google Scholar]
- Saideni, W.; Helbert, D.; Courreges, F.; Cances, J.P. An Overview on Deep Learning Techniques for Video Compressive Sensing. Appl. Sci. 2022, 12, 2734. [Google Scholar] [CrossRef]
- Yuan, X.; Liu, Y.; Suo, J.; Dai, Q. Plug-and-play algorithms for large-scale snapshot compressive imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1447–1457. [Google Scholar]
- Qiao, M.; Meng, Z.; Ma, J.; Yuan, X. Deep learning for video compressive sensing. APL Photonics 2020, 5, 030801. [Google Scholar] [CrossRef]
- Sun, Y.; Chen, X.; Kankanhalli, M.S.; Liu, Q.; Li, J. Video Snapshot Compressive Imaging Using Residual Ensemble Network. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5931–5943. [Google Scholar] [CrossRef]
- Huang, B.; Zhou, J.; Yan, X.; Jing, M.; Wan, R.; Fan, Y. CS-MCNet: A Video Compressive Sensing Reconstruction Network with Interpretable Motion Compensation. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
- Li, H.; Trocan, M.; Sawan, M.; Galayko, D. Serial Decoders-Based Auto-Encoders for Image Reconstruction. Appl. Sci. 2022, 12, 8256. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–15. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with Transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image Transformers & distillation through attention. In International Conference on Machine Learning; PMLR: New York, NY, USA, 2021; pp. 10347–10357. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical vision Transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3202–3211. [Google Scholar]
- Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image Restoration Using Swin Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
- Cui, C.; Xu, L.; Yang, B.; Ke, J. Meta-TR: Meta-Attention Spatial Compressive Imaging Network with Swin Transformer. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 6236–6247. [Google Scholar] [CrossRef]
- Saideni, W.; Courreges, F.; Helbert, D.; Cances, J.P. End-to-End Video Snapshot Compressive Imaging using Video Transformers. In Proceedings of the 11th International Conference on Image Processing Theory, Tools and Applications (IPTA), Salzburg, Austria, 19–22 April 2022; pp. 1–6. [Google Scholar]
- Chen, J.; Sun, Y.; Liu, Q.; Huang, R. Learning memory augmented cascading network for compressed sensing of images. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 513–529. [Google Scholar]
- Hitomi, Y.; Gu, J.; Gupta, M.; Mitsunaga, T.; Nayar, S.K. Video from a single coded exposure photograph using a learned over-complete dictionary. In Proceedings of the International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 287–294. [Google Scholar]
- Reddy, D.; Veeraraghavan, A.; Chellappa, R. P2C2: Programmable pixel compressive camera for high-speed imaging. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 20–25 June 2011; pp. 329–336. [Google Scholar]
- Sun, Y.; Yuan, X.; Pang, S. Compressive high-speed stereo imaging. Opt. Express 2017, 25, 18182–18190. [Google Scholar] [CrossRef] [PubMed]
- Yuan, X.; Llull, P.; Liao, X.; Yang, J.; Brady, D.J.; Sapiro, G.; Carin, L. Low-cost compressive sensing for color video and depth. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3318–3325. [Google Scholar]
- Pont-Tuset, J.; Perazzi, F.; Caelles, S.; Arbeláez, P.; Sorkine-Hornung, A.; Van Gool, L. The 2017 davis challenge on video object segmentation. arXiv 2017, arXiv:1704.00675. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Methods | Runner | Kobe | Aerial | Crash | Traffic | Drop | Average |
---|---|---|---|---|---|---|---|
GAP-TV [13] | 28.46 0.898 | 26.45 0.848 | 24.82 0.838 | 25.05 0.828 | 20.89 0.714 | 33.79 0.963 | 26.58 0.848 |
GMM-TP [11] | 29.56 0.899 | 25.13 0.788 | 26.26 0.800 | 25.71 0.868 | 25.08 0.821 | 34.34 0.964 | 27.68 0.857 |
MMLE-MFA [9] | 29.37 0.895 | 24.63 0.756 | 26.55 0.848 | 26.11 0.860 | 22.67 0.728 | 35.88 0.971 | 27.54 0.843 |
MMLE-GMM [9] | 31.98 0.927 | 27.34 0.829 | 28.01 0.882 | 26.93 0.830 | 25.69 0.825 | 40.28 0.988 | 30.04 0.880 |
PnP-FFDNet [17] | 32.88 0.988 | 30.47 0.925 | 24.02 0.814 | 24.32 0.838 | 24.08 0.833 | 40.87 0.988 | 29.44 0.889 |
E2E-CNN [18] | 33.33 0.971 | 28.59 0.761 | 27.29 0.873 | 26.29 0.874 | 23.72 0.829 | 35.97 0.971 | 29.20 0.876 |
Saideni’s [30] | 37.67 ------ | 31.25 ------ | 29.21 ------ | 25.40 ------ | 28.15 ------ | 35.40 ------ | 31.18 ------ |
RE2-Net [19] | 35.46 0.962 | 30.05 0.917 | 28.96 0.905 | 27.94 0.917 | 26.62 0.898 | 42.11 0.991 | 31.86 0.935 |
Ours | 38.52 0.990 | 29.11 0.881 | 28.95 0.942 | 27.47 0.931 | 28.13 0.951 | 42.35 0.996 | 32.42 0.949 |
Experimental Setting | Residual Measurement Mechanism | Dynamic Feature Fusion | The Group Number of the Multi-Scale Structure | RWT | PSNR | SSIM |
---|---|---|---|---|---|---|
Ours | √ | √ | 3 | 3 | 32.42 | 0.949 |
Ablation 1 | √ | × | 3 | 3 | 31.92 | 0.935 |
Ablation 2 | × | √ | 3 | 3 | 31.81 | 0.936 |
Ablation 3 | × | × | 3 | 3 | 31.64 | 0.928 |
Ablation 4 | √ | √ | 2 | 3 | 31.54 | 0.930 |
Ablation 5 | √ | √ | 1 | 3 | 31.16 | 0.923 |
Ablation 6 | √ | √ | 3 | 2 | 32.29 | 0.938 |
Ablation 7 | √ | √ | 3 | 1 | 32.06 | 0.927 |
Methods | Time (s) | Frames per Second (FPS) | Parameters (×106) |
---|---|---|---|
GAP-TV | 49.67 | 0.75 | - |
GMM-TP | 184.84 | 0.20 | - |
MMLE-MFA | 930.3 | 0.04 | - |
MMLE-GMM | 3802.3 | 0.01 | - |
PnP-FFDNet | 10.32 | 3.62 | - |
E2E-CNN | 3.96 | 9.43 | 0.82 |
RE2-Net | 4.25 | 8.78 | 3.49 |
Ours | 6.73 | 4.75 | 2.16 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wen, J.; Huang, J.; Chen, X.; Huang, K.; Sun, Y. Transformer-Based Cascading Reconstruction Network for Video Snapshot Compressive Imaging. Appl. Sci. 2023, 13, 5922. https://doi.org/10.3390/app13105922
Wen J, Huang J, Chen X, Huang K, Sun Y. Transformer-Based Cascading Reconstruction Network for Video Snapshot Compressive Imaging. Applied Sciences. 2023; 13(10):5922. https://doi.org/10.3390/app13105922
Chicago/Turabian StyleWen, Jiaxuan, Junru Huang, Xunhao Chen, Kaixuan Huang, and Yubao Sun. 2023. "Transformer-Based Cascading Reconstruction Network for Video Snapshot Compressive Imaging" Applied Sciences 13, no. 10: 5922. https://doi.org/10.3390/app13105922
APA StyleWen, J., Huang, J., Chen, X., Huang, K., & Sun, Y. (2023). Transformer-Based Cascading Reconstruction Network for Video Snapshot Compressive Imaging. Applied Sciences, 13(10), 5922. https://doi.org/10.3390/app13105922