Self-Supervised Learning to Detect Key Frames in Videos
Abstract
:1. Introduction
2. Related Work
2.1. Video Annotation Methods
2.2. Key Frame Detection
2.3. Problem Formulation
2.4. Frame-Level Video Labelling
3. Proposed Approach
3.1. Learning a Deep Model for Automatic Key Frame Detection
3.2. Implementation Details
4. Experiments
4.1. Datasets
Evaluation Criteria
4.2. Results
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Li, X.; Zhao, B.; Lu, X. Key frame extraction in the summary space. IEEE Trans. Cybern. 2017, 48, 1923–1934. [Google Scholar] [CrossRef] [PubMed]
- Aafaq, N.; Mian, A.; Liu, W.; Gilani, S.Z.; Shah, M. Video description: A survey of methods, datasets, and evaluation metrics. ACM Comput. Surv. (CSUR) 2019, 52, 1–37. [Google Scholar] [CrossRef] [Green Version]
- Aafaq, N.; Akhtar, N.; Liu, W.; Gilani, S.Z.; Mian, A. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12487–12496. [Google Scholar]
- Mahasseni, B.; Lam, M.; Todorovic, S. Unsupervised video summarization with adversarial LSTM networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1–10. [Google Scholar]
- Huang, C.; Wang, H. Novel Key-frames Selection Framework for Comprehensive Video Summarization. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 577–589. [Google Scholar] [CrossRef]
- Acuna, D.; Ling, H.; Kar, A.; Fidler, S. Efficient interactive annotation of segmentation datasets with polygon-rnn++. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 859–868. [Google Scholar]
- Ma, M.; Mei, S.; Wan, S.; Hou, J.; Wang, Z.; Feng, D.D. Video summarization via block sparse dictionary selection. Neurocomputing 2020, 378, 197–209. [Google Scholar] [CrossRef]
- Ji, Z.; Zhao, Y.; Pang, Y.; Li, X.; Han, J. Deep Attentive Video Summarization With Distribution Consistency Learning. IEEE Trans. Neural Netw. Learn. Syst. 2020. [Google Scholar] [CrossRef]
- Kulhare, S.; Sah, S.; Pillai, S.; Ptucha, R. Key frame extraction for salient activity recognition. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 835–840. [Google Scholar]
- Wu, Z.; Xiong, C.; Ma, C.Y.; Socher, R.; Davis, L.S. AdaFrame: Adaptive Frame Selection for Fast Video Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1278–1287. [Google Scholar]
- Korbar, B.; Tran, D.; Torresani, L. SCSampler: Sampling Salient Clips From Video for Efficient Action Recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–3 November 2019. [Google Scholar]
- Kumar, G.N.; Reddy, V. Key Frame Extraction Using Rough Set Theory for Video Retrieval. In Soft Computing and Signal Processing; Springer: Heidelberg, Germany, 2019; pp. 751–757. [Google Scholar]
- Hsiao, M.; Westman, E.; Zhang, G.; Kaess, M. Keyframe-based dense planar SLAM. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 5110–5117. [Google Scholar]
- Sheng, L.; Xu, D.; Ouyang, W.; Wang, X. Unsupervised Collaborative Learning of Keyframe Detection and Visual Odometry Towards Monocular Deep SLAM. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–3 November 2019; pp. 4302–4311. [Google Scholar]
- Lin, X.; Sun, D.; Lin, T.Y.; Eustice, R.M.; Ghaffari, M. A Keyframe-based Continuous Visual SLAM for RGB-D Cameras via Nonparametric Joint Geometric and Appearance Representation. arXiv 2019, arXiv:1912.01064. [Google Scholar]
- Aote, S.S.; Potnurwar, A. An automatic video annotation framework based on two level keyframe extraction mechanism. Multimed. Tools Appl. 2019, 78, 14465–14484. [Google Scholar] [CrossRef]
- Wen, S.; Liu, W.; Yang, Y.; Huang, T.; Zeng, Z. Generating realistic videos from keyframes with concatenated GANs. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 2337–2348. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 568–576. [Google Scholar]
- Ye, J.; Janardan, R.; Li, Q. Two-dimensional linear discriminant analysis. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 5–8 December 2005; pp. 1569–1576. [Google Scholar]
- Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
- De Avila, S.E.F.; Lopes, A.P.B.; da Luz, A., Jr.; de Albuquerque Araújo, A. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognit. Lett. 2011, 32, 56–68. [Google Scholar] [CrossRef]
- Doermann, D.; Mihalcik, D. Tools and techniques for video performance evaluation. In Proceedings of the 15th International Conference on Pattern Recognition. ICPR-2000, Barcelona, Spain, 3–7 September 2000; Volume 4, pp. 167–170. [Google Scholar]
- Yuen, J.; Russell, B.; Liu, C.; Torralba, A. Labelme video: Building a video database with human annotations. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 Septemebr–2 October 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 1451–1458. [Google Scholar]
- Ke, X.; Zou, J.; Niu, Y. End-to-End Automatic Image Annotation Based on Deep CNN and Multi-Label Data Augmentation. IEEE Trans. Multimed. 2019, 21, 2093–2106. [Google Scholar] [CrossRef]
- Feng, S.; Manmatha, R.; Lavrenko, V. Multiple bernoulli relevance models for image and video annotation. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 27 June–2 July 2004; Volume 2, p. II. [Google Scholar]
- Gao, L.; Song, J.; Nie, F.; Yan, Y.; Sebe, N.; Tao Shen, H. Optimal graph learning with partial tags and multiple features for image and video annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4371–4379. [Google Scholar]
- Song, J.; Gao, L.; Nie, F.; Shen, H.T.; Yan, Y.; Sebe, N. Optimized graph learning using partial tags and multiple features for image and video annotation. IEEE Trans. Image Process. 2016, 25, 4999–5011. [Google Scholar] [CrossRef] [PubMed]
- Song, J.; Zhang, H.; Li, X.; Gao, L.; Wang, M.; Hong, R. Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans. Image Process. 2018, 27, 3210–3221. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Berg, A.; Johnander, J.; Durand de Gevigney, F.; Ahlberg, J.; Felsberg, M. Semi-automatic annotation of objects in visual-thermal video. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Seoul, Korea, 27–28 October 2019. [Google Scholar]
- Porzi, L.; Hofinger, M.; Ruiz, I.; Serrat, J.; Bulò, S.R.; Kontschieder, P. Learning Multi-Object Tracking and Segmentation from Automatic Annotations. arXiv 2019, arXiv:1912.02096. [Google Scholar]
- Gygli, M.; Ferrari, V. Efficient Object Annotation via Speaking and Pointing. arXiv 2019, arXiv:1905.10576. [Google Scholar] [CrossRef] [Green Version]
- Caba Heilbron, F.; Escorcia, V.; Ghanem, B.; Carlos Niebles, J. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 961–970. [Google Scholar]
- Sigurdsson, G.A.; Varol, G.; Wang, X.; Farhadi, A.; Laptev, I.; Gupta, A. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 510–526. [Google Scholar]
- Gao, J.; Sun, C.; Yang, Z.; Nevatia, R. Tall: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5267–5275. [Google Scholar]
- Gygli, M.; Grabner, H.; Riemenschneider, H.; Van Gool, L. Creating summaries from user videos. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 505–520. [Google Scholar]
- Song, Y.; Vallmitjana, J.; Stent, A.; Jaimes, A. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5179–5187. [Google Scholar]
- Wolf, W. Key frame selection by motion analysis. In Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, GA, USA, 9 May 1996; Volume 2, pp. 1228–1231. [Google Scholar]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Guan, G.; Wang, Z.; Lu, S.; Da Deng, J.; Feng, D.D. Keypoint-based keyframe selection. IEEE Trans. Circuits Syst. Video Technol. 2013, 23, 729–734. [Google Scholar] [CrossRef]
- Zhuang, Y.; Rui, Y.; Huang, T.S.; Mehrotra, S. Adaptive key frame extraction using unsupervised clustering. In Proceedings of the1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269), Chicago, IL, USA, 7 October 1998; Volume 1, pp. 866–870. [Google Scholar]
- Rasheed, Z.; Shah, M. Detection and representation of scenes in videos. IEEE Trans. Multimed. 2005, 7, 1097–1105. [Google Scholar] [CrossRef]
- Cernekova, Z.; Pitas, I.; Nikou, C. Information theory-based shot cut/fade detection and video summarization. IEEE Trans. Circuits Syst. Video Technol. 2006, 16, 82–91. [Google Scholar] [CrossRef] [Green Version]
- Tang, H.; Liu, H.; Xiao, W.; Sebe, N. Fast and robust dynamic hand gesture recognition via key frames extraction and feature fusion. Neurocomputing 2019, 331, 424–433. [Google Scholar] [CrossRef] [Green Version]
- VáZquez-MartíN, R.; Bandera, A. Spatio-temporal feature-based keyframe detection from video shots using spectral clustering. Pattern Recognit. Lett. 2013, 34, 770–779. [Google Scholar] [CrossRef]
- Cong, Y.; Yuan, J.; Luo, J. Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Trans. Multimed. 2012, 14, 66–75. [Google Scholar] [CrossRef]
- Mei, S.; Guan, G.; Wang, Z.; Wan, S.; He, M.; Feng, D.D. Video summarization via minimum sparse reconstruction. Pattern Recognit. 2015, 48, 522–533. [Google Scholar] [CrossRef]
- Meng, J.; Wang, H.; Yuan, J.; Tan, Y.P. From keyframes to key objects: Video summarization by representative object proposal selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1039–1048. [Google Scholar]
- Lai, K.T.; Yu, F.X.; Chen, M.S.; Chang, S.F. Video event detection by inferring temporal instance labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2243–2250. [Google Scholar]
- Zhou, L.; Nagahashi, H. Real-time Action Recognition Based on Key Frame Detection. In Proceedings of the 9th International Conference on Machine Learning and Computing, Singapore, 24–26 February 2017; pp. 272–277. [Google Scholar]
- Yang, H.; Wang, B.; Lin, S.; Wipf, D.; Guo, M.; Guo, B. Unsupervised extraction of video highlights via robust recurrent auto-encoders. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4633–4641. [Google Scholar]
- Kar, A.; Rai, N.; Sikka, K.; Sharma, G. AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Bhardwaj, S.; Srinivasan, M.; Khapra, M.M. Efficient Video Classification Using Fewer Frames. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 354–363. [Google Scholar]
- Mademlis, I.; Tefas, A.; Pitas, I. A salient dictionary learning framework for activity video summarization via key-frame extraction. Inf. Sci. 2018, 432, 319–331. [Google Scholar] [CrossRef]
- GogiReddy, H.S.S.R.; Sinha, N. Video Key Frame Detection Using Block Sparse Coding. In Proceedings of the 3rd International Conference on Computer Vision and Image Processing, Jabalpur, India, 29 September–1 October 2018; pp. 85–93. [Google Scholar]
- Kwak, I.; Guo, J.Z.; Hantman, A.; Kriegman, D.; Branson, K. Detecting the starting frame of actions in video. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 489–497. [Google Scholar]
- Ren, J.; Shen, X.; Lin, Z.; Mech, R. Best Frame Selection in a Short Video. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 3212–3221. [Google Scholar]
- Jian, M.; Zhang, S.; Wu, L.; Zhang, S.; Wang, X.; He, Y. Deep key frame extraction for sport training. Neurocomputing 2019, 328, 147–156. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Feichtenhofer, C.; Pinz, A.; Wildes, R. Spatiotemporal residual networks for video action recognition. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 3468–3476. [Google Scholar]
- Feichtenhofer, C.; Pinz, A.; Wildes, R.P. Spatiotemporal multiplier networks for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4768–4777. [Google Scholar]
- Feichtenhofer, C.; Pinz, A.; Wildes, R.P.; Zisserman, A. Deep Insights into Convolutional Networks for Video Recognition. Int. J. Comput. Vis. 2019, 128, 420–437. [Google Scholar] [CrossRef] [Green Version]
- Jiang, B.; Wang, M.; Gan, W.; Wu, W.; Yan, J. STM: SpatioTemporal and motion encoding for action recognition. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 2000–2009. [Google Scholar]
- Li, C.; Zhong, Q.; Xie, D.; Pu, S. Collaborative Spatiotemporal Feature Learning for Video Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 7872–7881. [Google Scholar]
- Prince, S.J.; Elder, J.H. Probabilistic linear discriminant analysis for inferences about identity. In Proceedings of the2007 11th IEEE International Conference on Computer Vision, Rio de Janeiro, Brazil, 14–21 October 2007; pp. 1–8. [Google Scholar]
- Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 675–678. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2009, Miami Beach, FL, USA, 20–26 June 2009; pp. 248–255. [Google Scholar]
- Zach, C.; Pock, T.; Bischof, H. A duality based approach for realtime TV-L1 optical flow. In Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2007; pp. 214–223. [Google Scholar]
- Gong, B.; Chao, W.L.; Grauman, K.; Sha, F. Diverse sequential subset selection for supervised video summarization. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2069–2077. [Google Scholar]
- Zhang, K.; Chao, W.L.; Sha, F.; Grauman, K. Video summarization with long short-term memory. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 766–782. [Google Scholar]
- Kuanar, S.K.; Panda, R.; Chowdhury, A.S. Video key frame extraction through dynamic Delaunay clustering with a structural constraint. J. Vis. Commun. Image Represent. 2013, 24, 1212–1227. [Google Scholar] [CrossRef]
- Zhang, K.; Chao, W.L.; Sha, F.; Grauman, K. Summary transfer: Exemplar-based subset selection for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1059–1067. [Google Scholar]
- Fu, T.J.; Tai, S.H.; Chen, H.T. Attentive and adversarial learning for video summarization. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 1579–1587. [Google Scholar]
- Ji, Z.; Xiong, K.; Pang, Y.; Li, X. Video summarization with attention-based encoder-decoder networks. IEEE Trans. Circuits Syst. Video Technol. 2019. [Google Scholar] [CrossRef] [Green Version]
- Li, X.; Zhao, B.; Lu, X. A general framework for edited video and raw video summarization. IEEE Trans. Image Process. 2017, 26, 3652–3664. [Google Scholar] [CrossRef] [Green Version]
- Gaidon, A.; Harchaoui, Z.; Schmid, C. Temporal localization of actions with actoms. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2782–2795. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Average Error in Key Frame Detection | ||
---|---|---|
Class Name | Number Detected | Key Frame Position |
Baseball Pitch | ±1.73 | ±0.455 |
Basket. Dunk | ±1.64 | ±0.640 |
Billiards | ±2.18 | ±1.300 |
Clean and Jerk | ±2.27 | ±1.202 |
Cliff Diving | ±2.45 | ±0.627 |
Cricket Bowl. | ±2.45 | ±1.14 |
Cricket Shot | ±1.27 | ±0.828 |
Diving | ±2.00 | ±0.907 |
Frisbee Catch | ±1.73 | ±0.546 |
Golf Swing | ±2.45 | ±0.752 |
Hamm. Throw | ±1.73 | ±1.223 |
High Jump | ±1.73 | ±0.434 |
Javelin Throw | ±2.45 | ±0.555 |
Long Jump | ±2.27 | ±0.611 |
Pole Vault | ±2.64 | ±1.139 |
Shotput | ±2.00 | ±0.564 |
Soccer Penalty | ±2.09 | ±0.712 |
Tennis Swing | ±1.64 | ±0.554 |
Throw Discus | ±2.09 | ±0.642 |
Volley. Spike | ±1.54 | ±0.633 |
Average accuracy | ±2.02 | ±0.773 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yan, X.; Gilani, S.Z.; Feng, M.; Zhang, L.; Qin, H.; Mian, A. Self-Supervised Learning to Detect Key Frames in Videos. Sensors 2020, 20, 6941. https://doi.org/10.3390/s20236941
Yan X, Gilani SZ, Feng M, Zhang L, Qin H, Mian A. Self-Supervised Learning to Detect Key Frames in Videos. Sensors. 2020; 20(23):6941. https://doi.org/10.3390/s20236941
Chicago/Turabian StyleYan, Xiang, Syed Zulqarnain Gilani, Mingtao Feng, Liang Zhang, Hanlin Qin, and Ajmal Mian. 2020. "Self-Supervised Learning to Detect Key Frames in Videos" Sensors 20, no. 23: 6941. https://doi.org/10.3390/s20236941
APA StyleYan, X., Gilani, S. Z., Feng, M., Zhang, L., Qin, H., & Mian, A. (2020). Self-Supervised Learning to Detect Key Frames in Videos. Sensors, 20(23), 6941. https://doi.org/10.3390/s20236941