A Multi-Scale Video Longformer Network for Action Recognition
Abstract
:1. Introduction
- (1)
- Introduction of a 3D Longformer structure and its application in constructing a Multi-Scale Video Longformer action recognition network.
- (2)
- Implementation of learnable absolute 3D position encoding and relative 3D position encoding based on depthwise separable convolution within the 3D Longformer structure.
- (3)
- Creation of an assembly action dataset consisting of eight assembly actions: snipping, cutting, grinding, hammering, brushing, wrapping, turning screws, and turning wrenches.
- (4)
- Comprehensive experiments were conducted to validate the proposed approach using the UCF101, HMDB51, and assembly action datasets.
2. Related Works
3. Method
3.1. Overall Network Architecture
3.2. Patchembed
3.3. 3D Longformer Attenblock and Mlpblock
4. Experiments
4.1. Dataset
4.1.1. Assembly Action Dataset
4.1.2. Ucf101 Dataset
4.1.3. Hmdb51 Dataset
4.2. Experimental Environment
4.3. Evaluation Metrics
4.4. Hyperparameter Tuning
4.4.1. Position Encoding
4.4.2. Number of 3D Longformer Attenblocks and Mlpblocks
4.4.3. Passing Global Tokens to the Next Stage
4.4.4. Training on the Assembly Action Dataset
4.5. Comparative Experiments
4.6. Visualization
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. The Confusion Matrix of MSVL on the UCF101 Dataset
References
- Yang, S.; Zhao, Y.; Ma, Y. Analysis of the reasons and development of short video application-Taking Tik Tok as an example. In Proceedings of the 2019 9th International Conference on Information and Social Science (ICISS 2019), Manila, Philippines, 12–14 July 2019; pp. 12–14. [Google Scholar]
- Xiao, X.; Xu, D.; Wan, W. Overview: Video recognition from handcrafted method to deep learning method. In Proceedings of the 2016 International Conference on Audio, Language and Image Processing (ICALIP), Shanghai, China, 11–12 July 2016; pp. 646–651. [Google Scholar]
- Lavee, G.; Rivlin, E.; Rudzsky, M. Understanding video events: A survey of methods for automatic interpretation of semantic occurrences in video. IEEE Trans. Syst. Man Cybern. Part (Appl. Rev.) 2009, 39, 489–504. [Google Scholar] [CrossRef]
- Dalal, N.; Triggs, B.; Schmid, C. Human detection using oriented histograms of flow and appearance. In Proceedings of the Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; pp. 428–441. [Google Scholar]
- Wang, H.; Schmid, C. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3551–3558. [Google Scholar]
- O’Shea, K.; Nash, R. An introduction to convolutional neural networks. arXiv 2015, arXiv:1511.08458. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Al-Berry, M.; Salem, M.M.; Hussein, A.; Tolba, M. Spatio-temporal motion detection for intelligent surveillance applications. Int. J. Comput. Methods 2015, 12, 1350097. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 2014, 27, 568. [Google Scholar]
- Huang, D.A.; Ramanathan, V.; Mahajan, D.; Torresani, L.; Paluri, M.; Fei-Fei, L.; Niebles, J.C. What makes a video a video: Analyzing temporal information in video understanding models and datasets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7366–7375. [Google Scholar]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
- Albawi, S.; Mohammed, T.A.; Al-Zawi, S. Understanding of a convolutional neural network. In Proceedings of the 2017 International Conference on Engineering and Technology (ICET), Antalya, Turkey, 21–23 August 2017; pp. 1–6. [Google Scholar]
- Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Networks Learn. Syst. 2021, 33, 6999–7019. [Google Scholar] [CrossRef] [PubMed]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? In Proceedings of the 2021 International Conference on Machine Learning, Online, 18–24 July 2021; Volume 2, p. 4. [Google Scholar]
- Fan, H.; Xiong, B.; Mangalam, K.; Li, Y.; Yan, Z.; Malik, J.; Feichtenhofer, C. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6824–6835. [Google Scholar]
- Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 20–36. [Google Scholar]
- Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar]
- Sun, C.; Shrivastava, A.; Vondrick, C.; Murphy, K.; Sukthankar, R.; Schmid, C. Actor-centric relation network. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 318–334. [Google Scholar]
- Wu, C.Y.; Feichtenhofer, C.; Fan, H.; He, K.; Krahenbuhl, P.; Girshick, R. Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 284–293. [Google Scholar]
- Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
- Girdhar, R.; Carreira, J.; Doersch, C.; Zisserman, A. Video action transformer network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 244–253. [Google Scholar]
- Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6836–6846. [Google Scholar]
- Zhang, Y.; Li, X.; Liu, C.; Shuai, B.; Zhu, Y.; Brattoli, B.; Chen, H.; Marsic, I.; Tighe, J. Vidtr: Video transformer without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13577–13587. [Google Scholar]
- Liu, X.; Wang, Q.; Hu, Y.; Tang, X.; Zhang, S.; Bai, S.; Bai, X. End-to-end temporal action detection with transformer. IEEE Trans. Image Process. 2022, 31, 5427–5441. [Google Scholar] [CrossRef] [PubMed]
- Weng, Y.; Pan, Z.; Han, M.; Chang, X.; Zhuang, B. An Efficient Spatio-Temporal Pyramid Transformer for Action Detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 358–375. [Google Scholar]
- Wu, C.Y.; Li, Y.; Mangalam, K.; Fan, H.; Xiong, B.; Malik, J.; Feichtenhofer, C. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13587–13597. [Google Scholar]
- Wei, C.; Fan, H.; Xie, S.; Wu, C.Y.; Yuille, A.; Feichtenhofer, C. Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14668–14678. [Google Scholar]
- Tong, Z.; Song, Y.; Wang, J.; Wang, L. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inf. Process. Syst. 2022, 35, 10078–10093. [Google Scholar]
- Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3202–3211. [Google Scholar]
- Li, K.; Wang, Y.; Gao, P.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. Uniformer: Unified transformer for efficient spatiotemporal representation learning. arXiv 2022, arXiv:2201.04676. [Google Scholar]
- Neimark, D.; Bar, O.; Zohar, M.; Asselmann, D. Video transformer network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3163–3172. [Google Scholar]
- Zhang, P.; Dai, X.; Yang, J.; Xiao, B.; Yuan, L.; Zhang, L.; Gao, J. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2998–3008. [Google Scholar]
- Chu, X.; Zhang, B.; Tian, Z.; Wei, X.; Xia, H. Do we really need explicit position encodings for vision transformers. arXiv 2021, arXiv:2102.10882. [Google Scholar]
- Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
- Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the 2011 International Conference on Computer Vision, Washington, DC, USA, 6–13 November 2011; pp. 2556–2563. [Google Scholar]
- Fan, H.; Li, Y.; Xiong, B.; Lo, W.Y.; Feichtenhofer, C. PySlowFast. 2020. Available online: https://github.com/facebookresearch/slowfast (accessed on 23 January 2024).
- Li, C.; Yang, J.; Zhang, P.; Gao, M.; Xiao, B.; Dai, X.; Yuan, L.; Gao, J. Efficient self-supervised vision transformers for representation learning. arXiv 2021, arXiv:2106.09785. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
- Contributors, M. OpenMMLab’s Next Generation Video Understanding Toolbox and Benchmark. 2020. Available online: https://github.com/open-mmlab/mmaction2 (accessed on 23 January 2024).
- Lin, J.; Gan, C.; Han, S. TSM: Temporal Shift Module for Efficient Video Understanding. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Name of Parameter | Stage1 | Stage2 | Stage3 | Stage4 |
---|---|---|---|---|
n | 1 | 2 | 8 | 1 |
4 | 2 | 2 | 2 | |
96 | 192 | 384 | 768 | |
3 | 3 | 6 | 12 | |
32 | 64 | 64 | 64 | |
T | 8 | 8 | 8 | 8 |
8 | 8 | 8 | 8 | |
56 × 56 | 28 × 28 | 14 × 14 | 7 × 7 | |
15 × 15 | 15 × 15 | 15 × 15 | 15 × 15 |
Label | Acc./% | Label | Acc./% | Label | Acc./% | Label | Acc./% |
---|---|---|---|---|---|---|---|
ApplyEyeMakeup | 99.07 | Drumming | 99.50 | MilitaryParade | 97.37 | Shotput | 95.73 |
ApplyLipstick | 100.00 | Fencing | 99.32 | Mixing | 100.00 | SkateBoarding | 81.03 |
Archery | 96.95 | FieldHockeyPenalty | 93.89 | MoppingFloor | 87.50 | Skiing | 91.45 |
BabyCrawling | 89.10 | FloorGymnastics | 98.65 | Nunchucks | 100.00 | Skijet | 94.23 |
BalanceBeam | 94.57 | FrisbeeCatch | 100.00 | ParallelBars | 94.35 | SkyDiving | 100.00 |
BandMarching | 98.33 | FrontCrawl | 93.33 | PizzaTossing | 95.16 | SoccerJuggling | 92.26 |
BaseballPitch | 97.06 | GolfSwing | 95.83 | PlayingCello | 100.00 | SoccerPenalty | 100.00 |
Basketball | 85.62 | Haircut | 85.47 | PlayingDaf | 97.87 | StillRings | 88.89 |
BasketballDunk | 100.00 | Hammering | 95.83 | PlayingDhol | 100.00 | SumoWrestling | 100.00 |
BenchPress | 100.00 | HammerThrow | 100.00 | PlayingFlute | 98.37 | Surfing | 100.00 |
Biking | 89.06 | HandstandPushups | 98.91 | PlayingGuitar | 100.00 | Swing | 100.00 |
Billiards | 100.00 | HandstandWalking | 93.27 | PlayingPiano | 100.00 | TableTennisShot | 100.00 |
BlowDryHair | 95.14 | HeadMassage | 97.37 | PlayingSitar | 100.00 | TaiChi | 98.61 |
BlowingCandles | 90.28 | HighJump | 90.15 | PlayingTabla | 100.00 | TennisSwing | 100.00 |
BodyWeightSquats | 100.00 | HorseRace | 98.44 | PlayingViolin | 99.31 | ThrowDiscus | 84.87 |
Bowling | 99.42 | HorseRiding | 91.00 | PoleVault | 82.84 | TrampolineJumping | 100.00 |
BoxingPunchingBag | 100.00 | HulaHoop | 100.00 | PommelHorse | 97.97 | Typing | 100.00 |
BoxingSpeedBag | 100.00 | IceDancing | 99.48 | PullUps | 98.96 | UnevenBars | 95.59 |
BreastStroke | 92.31 | JavelinThrow | 92.86 | Punch | 98.08 | VolleyballSpiking | 85.16 |
BrushingTeeth | 96.59 | JugglingBalls | 100.00 | PushUps | 100.00 | WalkingWithDog | 80.62 |
CleanAndJerk | 100.00 | JumpingJack | 100.00 | Rafting | 100.00 | WallPushups | 100.00 |
CliffDiving | 96.28 | JumpRope | 100.00 | RockClimbingIndoor | 96.11 | WritingOnBoard | 100.00 |
CricketBowling | 96.81 | Kayaking | 93.29 | RopeClimbing | 95.73 | YoYo | 96.62 |
CricketShot | 100.00 | Knitting | 99.32 | Rowing | 98.26 | ||
CuttingInKitchen | 98.53 | LongJump | 99.36 | SalsaSpin | 100.00 | ||
Diving | 98.03 | Lunges | 87.80 | ShavingBeard | 100.00 |
Type | Top-1/% | Top-5/% | FLOPs/G | Params/M |
---|---|---|---|---|
Ape | 68.62 | 86.95 | 157 | 27.3 |
Rpe | 71.38 | 88.92 | 158 | 27.3 |
n | Top-1/% | Top-5/% | FLOPs/G | Params/M |
---|---|---|---|---|
(1, 1, 9, 1) | 69.41 | 88.37 | 156 | 28.5 |
(1, 2, 8, 1) | 71.38 | 88.92 | 158 | 27.3 |
Passing | Top-1/% | Top-5/% | FLOPs/G | Params/M |
---|---|---|---|---|
Yes | 72.94 | 89.42 | 158 | 27.3 |
No | 71.38 | 88.92 | 158 | 27.3 |
Method | HMDB51 | UCF101 | FLOPs/G | Params/M | ||
---|---|---|---|---|---|---|
Top-1/% | Top-5/% | Top-1/% | Top-5/% | |||
C3D [11] | 51.6 | 77.4 | 82.3 | 95.1 | 38.5 | 78.4 |
TSN [19] | 56.1 | 84.2 | 83.1 | 96.9 | 102.7 | 24.3 |
I3D [12] | 66.4 | 86.5 | 91.8 | 98.2 | 59.3 | 35.4 |
R(2 + 1)D [20] | 66.6 | 86.8 | 93.6 | 99.2 | 53 | 63.8 |
TSM [44] | 69.5 | 88.2 | 92.6 | 99.1 | 32.8 | 23.8 |
divST [16] | 72.7 | 93.6 | 94.7 | 99.7 | 196 | 122.2 |
Swin-S [33] | 72.2 | 93.4 | 97.6 | 99.7 | 166 | 49.8 |
MSVL (Ours) | 72.9 | 89.4 | 97.6 | 99.8 | 158 | 27.3 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, C.; Zhang, C.; Dong, X. A Multi-Scale Video Longformer Network for Action Recognition. Appl. Sci. 2024, 14, 1061. https://doi.org/10.3390/app14031061
Chen C, Zhang C, Dong X. A Multi-Scale Video Longformer Network for Action Recognition. Applied Sciences. 2024; 14(3):1061. https://doi.org/10.3390/app14031061
Chicago/Turabian StyleChen, Congping, Chunsheng Zhang, and Xin Dong. 2024. "A Multi-Scale Video Longformer Network for Action Recognition" Applied Sciences 14, no. 3: 1061. https://doi.org/10.3390/app14031061
APA StyleChen, C., Zhang, C., & Dong, X. (2024). A Multi-Scale Video Longformer Network for Action Recognition. Applied Sciences, 14(3), 1061. https://doi.org/10.3390/app14031061