Joint-Based Action Progress Prediction
Abstract
:1. Introduction
- We present a joint-based action prediction model. The architecture is based on a recurrent model that estimates the progress of the observed action as it is performed, emitting predictions online, for every frame. To the best of our knowledge, we are the first to adopt a joint-based approach for action progress prediction.
- We add to our progress prediction model additional modules to estimate body joints and the category of the ongoing action. This allows us to estimate progress directly from raw RGB pixels, reasoning on joint positions.
- The proposed progress prediction model is highly efficient and can be used in real-time online settings. We propose an analysis of the execution cost under different scenarios, depending on different degrees of data availability.
2. Related Work
3. Joint-Based Action Progress Prediction
3.1. Action Progress Prediction
- Many-to-One: we consider a window of N frames that are passed sequentially to the network. After all such frames have been considered, a single output is produced, that is the progress prediction for the last frame in the input sequence. When a new frame is available, the window is moved in order to consider the N most recent frames.
- Many-to-Many: each frame of the sequence is passed in real-time to the network, producing the progress prediction for that frame. The network has an internal state which is maintained and updated throughout the whole sequence so that previous frames influence the prediction relative to the current frame.
3.2. Joint Extraction—UniPose
3.3. Classification Module
4. Dataset
5. Experiments
5.1. Oracle Model
5.2. Progress Prediction with Body Joints Estimation
5.3. Progress Prediction with Action Classifier
5.4. Results
5.5. Cyclic Actions
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Mabrouk, A.B.; Zagrouba, E. Abnormal behavior recognition for intelligent video surveillance systems: A review. Expert Syst. Appl. 2018, 91, 480–491. [Google Scholar] [CrossRef]
- Han, Y.; Zhang, P.; Zhuo, T.; Huang, W.; Zhang, Y. Going deeper with two-stream ConvNets for action recognition in video surveillance. Pattern Recognit. Lett. 2018, 107, 83–90. [Google Scholar] [CrossRef]
- Le, Q.V.; Zou, W.Y.; Yeung, S.Y.; Ng, A.Y. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 3361–3368. [Google Scholar]
- Turchini, F.; Seidenari, L.; Del Bimbo, A. Understanding and localizing activities from correspondences of clustered trajectories. Comput. Vis. Image Underst. 2017, 159, 128–142. [Google Scholar] [CrossRef]
- Yuan, H.; Ni, D.; Wang, M. Spatio-temporal dynamic inference network for group activity recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7476–7485. [Google Scholar]
- Furnari, A.; Farinella, G.M. Rolling-unrolling lstms for action anticipation from first-person video. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 4021–4036. [Google Scholar] [CrossRef] [PubMed]
- Osman, N.; Camporese, G.; Coscia, P.; Ballan, L. SlowFast Rolling-Unrolling LSTMs for Action Anticipation in Egocentric Videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3437–3445. [Google Scholar]
- Manganaro, F.; Pini, S.; Borghi, G.; Vezzani, R.; Cucchiara, R. Hand gestures for the human-car interaction: The briareo dataset. In Proceedings of the International Conference on Image Analysis and Processing; Springer: Berlin/Heidelberg, Germany, 2019; pp. 560–571. [Google Scholar]
- Furnari, A.; Farinella, G.M. What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 6252–6261. [Google Scholar]
- Innocenti, S.U.; Becattini, F.; Pernici, F.; Del Bimbo, A. Temporal binary representation for event-based action recognition. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 10426–10432. [Google Scholar]
- Yang, P.; Mettes, P.; Snoek, C.G. Few-Shot Transformation of Common Actions into Time and Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16031–16040. [Google Scholar]
- Sevilla-Lara, L.; Liao, Y.; Güney, F.; Jampani, V.; Geiger, A.; Black, M.J. On the integration of optical flow and action recognition. In Proceedings of the German Conference on Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2018; pp. 281–297. [Google Scholar]
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 2014, 27, 568–576. [Google Scholar]
- Borghi, G.; Vezzani, R.; Cucchiara, R. Fast gesture recognition with multiple stream discrete HMMs on 3D skeletons. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 997–1002. [Google Scholar]
- D’Eusanio, A.; Pini, S.; Borghi, G.; Vezzani, R.; Cucchiara, R. Refinet: 3d human pose refinement with depth maps. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 2320–2327. [Google Scholar]
- Ferrari, C.; Casini, L.; Berretti, S.; Del Bimbo, A. Monocular 3D Body Shape Reconstruction under Clothing. J. Imaging 2021, 7, 257. [Google Scholar] [CrossRef]
- Li, B.; Li, X.; Zhang, Z.; Wu, F. Spatio-temporal graph routing for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8561–8568. [Google Scholar]
- Freire-Obregón, D.; Castrillón-Santana, M.; Barra, P.; Bisogni, C.; Nappi, M. An attention recurrent model for human cooperation detection. Comput. Vis. Image Underst. 2020, 197, 102991. [Google Scholar] [CrossRef]
- Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
- Artacho, B.; Savakis, A.E. UniPose: Unified Human Pose Estimation in Single Images and Videos. CoRR 2020, abs/2001.08095. Available online: http://xxx.lanl.gov/abs/2001.08095 (accessed on 29 December 2022).
- Shou, Z.; Wang, D.; Chang, S.F. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1049–1058. [Google Scholar]
- Chao, Y.W.; Vijayanarasimhan, S.; Seybold, B.; Ross, D.A.; Deng, J.; Sukthankar, R. Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, , Salt Lake City, UT, USA, 18–22 June 2018; pp. 1130–1139. [Google Scholar]
- Jain, M.; Van Gemert, J.; Jégou, H.; Bouthemy, P.; Snoek, C.G. Action localization with tubelets from motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 740–747. [Google Scholar]
- Singh, G.; Saha, S.; Sapienza, M.; Torr, P.H.; Cuzzolin, F. Online real-time multiple spatiotemporal action localisation and prediction. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3637–3646. [Google Scholar]
- Dwibedi, D.; Aytar, Y.; Tompson, J.; Sermanet, P.; Zisserman, A. Temporal Cycle-Consistency Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Becattini, F.; Uricchio, T.; Seidenari, L.; Ballan, L.; Bimbo, A.D. Am I done? Predicting action progress in videos. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2020, 16, 1–24. [Google Scholar] [CrossRef]
- Twinanda, A.P.; Yengera, G.; Mutter, D.; Marescaux, J.; Padoy, N. RSDNet: Learning to predict remaining surgery duration from laparoscopic videos without manual annotations. IEEE Trans. Med. Imaging 2018, 38, 1069–1078. [Google Scholar] [CrossRef] [Green Version]
- Wang, L.; Huynh, D.Q.; Koniusz, P. A comparative review of recent kinect-based action recognition algorithms. IEEE Trans. Image Process. 2019, 29, 15–28. [Google Scholar] [CrossRef] [Green Version]
- Duan, H.; Zhao, Y.; Chen, K.; Lin, D.; Dai, B. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 2969–2978. [Google Scholar]
- Kalogeiton, V.; Weinzaepfel, P.; Ferrari, V.; Schmid, C. Action tubelet detector for spatio-temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4405–4413. [Google Scholar]
- Camporese, G.; Coscia, P.; Furnari, A.; Farinella, G.M.; Ballan, L. Knowledge distillation for action anticipation via label smoothing. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 3312–3319. [Google Scholar]
- Guo, G.; Lai, A. A survey on still image based human action recognition. Pattern Recognit. 2014, 47, 3343–3361. [Google Scholar] [CrossRef]
- Sadanand, S.; Corso, J.J. Action bank: A high-level representation of activity in video. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1234–1241. [Google Scholar]
- Zhao, Y.; Xiong, Y.; Wang, L.; Wu, Z.; Tang, X.; Lin, D. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2914–2923. [Google Scholar]
- Lin, T.; Zhao, X.; Shou, Z. Single shot temporal action detection. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 988–996. [Google Scholar]
- Saha, S.; Singh, G.; Sapienza, M.; Torr, P.H.S.; Cuzzolin, F. Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos. In Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, 19–22 September 2016; Wilson, R.C., Hancock, E.R., Smith, W.A.P., Eds.; BMVA Press: York, UK, 2016. [Google Scholar]
- Mettes, P.; Gemert, J.C.v.; Snoek, C.G. Spot on: Action localization from pointly-supervised proposals. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 437–453. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
- Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 221–231. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
- Patra, A.; Noble, J. Sequential anatomy localization in fetal echocardiography videos. arXiv 2018, arXiv:1810.11868. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Cao, Z.; Hidalgo Martinez, G.; Simon, T.; Wei, S.; Sheikh, Y.A. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 172–186. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Fang, H.S.; Li, J.; Tang, H.; Xu, C.; Zhu, H.; Xiu, Y.; Li, Y.L.; Lu, C. AlphaPose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time. IEEE Trans. Pattern Anal. Mach. Intell. 2022. [Google Scholar] [CrossRef]
- Artacho, B.; Savakis, A. Waterfall Atrous Spatial Pooling Architecture for Efficient Semantic Segmentation. Sensors 2019, 24, 5361. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Siriborvornratanakul, T. Human behavior in image-based Road Health Inspection Systems despite the emerging AutoML. J. Big Data 2022, 9, 1–17. [Google Scholar] [CrossRef] [PubMed]
- Karmaker, S.K.; Hassan, M.M.; Smith, M.J.; Xu, L.; Zhai, C.; Veeramachaneni, K. Automl to date and beyond: Challenges and opportunities. ACM Comput. Surv. (CSUR) 2021, 54, 1–36. [Google Scholar] [CrossRef]
- Dou, W.; Liu, Y.; Liu, Z.; Yerezhepov, D.; Kozhamkulov, U.; Akilzhanova, A.; Dib, O.; Chan, C.K. An AutoML Approach for Predicting Risk of Progression to Active Tuberculosis based on Its Association with Host Genetic Variations. In Proceedings of the 2021 10th International Conference on Bioinformatics and Biomedical Science, Xiamen, China, 29–31 October 2021; pp. 82–88. [Google Scholar]
- Silva, M.O.; Valadão, M.D.; Cavalcante, V.L.; Santos, A.V.; Torres, G.M.; Mattos, E.V.; Pereira, A.M.; Uchôa, M.S.; Torres, L.M.; Linhares, J.E.; et al. Action Recognition of Industrial Workers using Detectron2 and AutoML Algorithms. In Proceedings of the 2022 IEEE International Conference on Consumer Electronics-Taiwan, Taipei, Taiwan, 6–8 July 2022; pp. 321–322. [Google Scholar]
- Jain, L.C.; Medsker, L.R. Recurrent Neural Networks: Design and Applications, 1st ed.; CRC Press, Inc.: Boca Raton, FL, USA, 1999. [Google Scholar]
- Cho, K.; van Merrienboer, B.; Bahdanau, D.; Bengio, Y. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. CoRR 2014, abs/1409.1259. Available online: http://xxx.lanl.gov/abs/1409.1259 (accessed on 29 December 2022).
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar] [CrossRef]
- Lin, T.; Maire, M.; Belongie, S.J.; Bourdev, L.D.; Girshick, R.B.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. CoRR 2014, abs/1405.0312. Available online: http://xxx.lanl.gov/abs/1405.0312 (accessed on 29 December 2022).
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar] [CrossRef]
- Zhang, W.; Zhu, M.; Derpanis, K.G. From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding. In Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 2248–2255. [Google Scholar]
- Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
Paradigm | Class Data Available | #GRU Stages | MAE |
---|---|---|---|
Many-to-One | ✗ | 1 | |
Many-to-One | ✓ | 1 | |
Many-to-One | ✓ | 2 | |
Many-to-Many | ✓ | 1 | |
Many-to-Many | ✓ | 2 |
Computing Unit | Total Execution Time | Average Inference Time | |
---|---|---|---|
CPU | Intel Xeon CPU @2.20GHz | 69.24 s | 1044.86 FPS |
GPU | NVIDIA Tesla T4 | 43.77 s | 1652.82 FPS |
Joint Source | UniPose Finetuning | Progress Finetuning | MAE |
---|---|---|---|
Unipose | ✗ | ✗ | 19.82 |
✓ | ✗ | 11.51 | |
✓ | ✓ | 7.90 | |
Oracle | ✗ | ✗ | 6.22 |
Backbone | Finetuning | GRU Layer | Test Accuracy |
---|---|---|---|
VGG16 | ✓ | ✗ | 70.94 |
InceptionV3 | ✗ | ✗ | 72.21 |
InceptionV3 | ✓ | ✗ | 77.23 |
InceptionV3 | ✓ | ✓ | 83.04 |
50% Constant Prediction | Joint-Based Progress Prediction | |||
---|---|---|---|---|
Joint source | - | Oracle | Unipose | Unipose |
Class source | - | Oracle | Oracle | Classifier |
MAE | 25.00% | 6.22% | 7.90% | 10.94% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pucci, D.; Becattini, F.; Del Bimbo, A. Joint-Based Action Progress Prediction. Sensors 2023, 23, 520. https://doi.org/10.3390/s23010520
Pucci D, Becattini F, Del Bimbo A. Joint-Based Action Progress Prediction. Sensors. 2023; 23(1):520. https://doi.org/10.3390/s23010520
Chicago/Turabian StylePucci, Davide, Federico Becattini, and Alberto Del Bimbo. 2023. "Joint-Based Action Progress Prediction" Sensors 23, no. 1: 520. https://doi.org/10.3390/s23010520
APA StylePucci, D., Becattini, F., & Del Bimbo, A. (2023). Joint-Based Action Progress Prediction. Sensors, 23(1), 520. https://doi.org/10.3390/s23010520