Deep Learning Innovations in Video Classification: A Survey on Techniques and Dataset Evaluations
Abstract
:1. Introduction
- Comprehensive overview of CNN-based models: We summarize the state-of-the-art CNN models used for image analysis and their applications in video classification, highlighting key architectures such as LeNet-5, AlexNet, VGG-16, VGG-19, Inception, GoogleNet, ResNet, SqueezeNet, Enet, ShuffleNet, and DenseNet. Each model’s features, evaluations, and problem-solving capabilities are detailed to provide a foundational understanding of video classification tasks.
- Deep learning approaches for video classification: We cover the integration of CNNs and recurrent neural networks (RNNs) for video classification. CNNs capture spatial features within video frames, while RNNs model temporal dependencies, making this combination effective for video understanding and exploring the use of transformer models in conjunction with CNNs to enhance spatial and temporal feature modeling.
- Uni-modal and multi-modal fusion frameworks: We compare uni-modal approaches, which utilize a single data modality, with multi-modal approaches that integrate various modalities (text, audio, visual, sensor, and more). Multi-modal methods improve classification accuracy by leveraging the complementary strengths of different data types.
- Feature extraction and representation techniques: Effective feature extraction is critical for video classification. We review techniques such as Scale-Invariant Feature Transform (SIFT) and data augmentation methods like random rotation and shift, which have been shown to improve classification accuracy in video datasets.
2. Background Studies on Video Classification
2.1. Relevant Surveys
2.2. Evolution of CNNs in Image Processing
2.3. Fundamental Techniques in CNN-Based Image Processing
3. Video Classification
3.1. Fundamental Deep Learning Architecture for Video Classification
3.2. Parallel Processing in Video Classification
3.3. The Methods Used in Video Classification
3.4. Hybrid Models for Video Classification
Hybrid Model | Strengths | Weaknesses |
---|---|---|
CNN-RNN [43] | Combines spatial features with temporal dynamics; specialized CNN for emotional features. | Complexity in training; high computational resources. |
Two-stream architecture [44] | Effectively integrates appearance and motion; separate processing for spatial and temporal data. | Pixel-wise correspondences between spatial and temporal features need learning; constraints on temporal scale. |
Hybrid network (RNN + 3D CNN) [45] | Late-fusion of RNN and C3D; encodes appearance and motion separately. | Increased model complexity; potential for overfitting with limited data. |
HAM [46] | Dynamically attends to relevant spatial features and improves discriminative power. | Requires significant tuning and potentially high computational cost. |
Hybrid CNN with channel attention mechanism [47] | Captures intricate spatio-temporal characteristics; dynamic focus on relevant features. | High complexity; requires substantial computational resources. |
CNN + vision transformer [48] | Long-term temporal relationship modeling; effective for anomaly detection in surveillance. | High computational and memory requirements; complex architecture. |
3DCLSTM + CNN-RNN [48] | Effective multi-modal fusion; captures spatio-temporal and textual emotional cues. | Very high complexity; demanding computational resources. |
Ensemble models [50] | Enhanced image quality; accurate vehicle classification. | Requires extensive pre-processing; high computational cost. |
3.5. Challenges and Future Directions of Video Classification
3.5.1. Challenges in Video Classification
3.5.2. Future Directions of Video Classification
3.6. Overview of Deep Learning Frameworks and Hybrid Models for Video Classification
3.6.1. Data Augmentation
3.6.2. Pre-Training on Hybrid Models
3.6.3. Hybrid Approach to Training Models on Video Classification
4. Evaluation Metrics and Comparison of Existing State-of-the-Art Video Classification Tasks
4.1. Performance Metrics for Evaluation in Video Classification
4.2. Comparison of Datasets for Video Classification
4.3. Comparison of some Existing Approaches on the UCF-101 Dataset
5. Discussion
5.1. Single and Hybrid Models
5.2. Data Augmentation and Pre-Training
5.3. Challenges in Video Classification
- Handling large-scale datasets: Efficiently processing and training on large-scale video datasets remains a significant challenge. The computational cost and time required are substantial, necessitating the development of more efficient algorithms and hardware optimizations.
- Generalization capabilities: It is crucial to ensure that models generalize well across diverse video datasets and real-world scenarios. Current models often struggle with overfitting to specific datasets, limiting their applicability.
- Temporal consistency and long-term dependencies: Capturing long-term temporal dependencies and maintaining temporal consistency across frames is another critical area where existing models can improve. While hybrid models have made strides in this direction, there is still room for enhancing effectiveness.
- Reporting and comparing performance metrics: The inconsistent reporting of performance metrics, such as the absence of standard deviation values, poses a significant challenge. Standard deviations are essential for assessing the variability and reliability of the reported accuracies. Future research should ensure the inclusion of these statistics to facilitate a more robust comparison of algorithm performance. This will enhance the reliability and relevance of the reported results and aid in developing more robust models.
5.4. Future Directions
- Efficient model architectures: Developing more efficient model architectures that can handle large-scale datasets without compromising performance is essential. This includes exploring novel neural network designs and leveraging advancements in hardware acceleration.
- Advanced data augmentation techniques: Incorporating more sophisticated data augmentation techniques to simulate real-world variations will improve model robustness and generalization.
- Integration of multimodal data: Utilizing multimodal data, such as combining video with audio and text, can provide a more comprehensive understanding of the content, leading to better classification performance.
- Improved temporal modeling: Enhancing temporal modeling capabilities will be crucial, particularly for long-term dependencies. This will involve developing new types of recurrent units or attention mechanisms explicitly tailored to video data.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Global Media Insight Home Page. Available online: https://www.globalmediainsight.com/blog/youtube-users-statistics/ (accessed on 7 June 2024).
- Youku Home Page. Available online: https://www.youku.com/ (accessed on 7 June 2024).
- TikTok Home Page. Available online: https://www.tiktok.com/ (accessed on 7 June 2024).
- Abu-El-Haija, S.; Kothari, N.; Lee, J.; Natsev, A.; Toderici, G.; Varadarajan, B.; Vijayanarasimhan, S. YouTube-8M: A Large-Scale Video Classification Benchmark. arXiv 2016, arXiv:1609.08675. [Google Scholar]
- Fujimoto, Y.; Bashar, K. Automatic classification of multi-attributes from person images using GPT-4 Vision. In Proceedings of the 6th International Conference on Image, Video and Signal Processing, New York, NY, USA, 14–16 March 2024; pp. 207–212. [Google Scholar]
- Anushya, A. Video Tagging Using Deep Learning: A Survey. Int. J. Comput. Sci. Mob. Comput. 2020, 9, 49–55. [Google Scholar]
- Rani, P.; Kaur, J.; Kaswan, S. Automatic video classification: A review. EAI Endorsed Trans. Creat. Technol. 2020, 7, 163996. [Google Scholar] [CrossRef]
- Li, Y.; Wang, C.; Liu, J. A Systematic Review of Literature on User Behavior in Video Game Live Streaming. Int. J. Environ. Res. Public Health 2020, 17, 3328. [Google Scholar] [CrossRef]
- Zuo, Z.; Yang, L.; Liu, Y.; Chao, F.; Song, R.; Qu, Y. Histogram of fuzzy local spatio-temporal descriptors for video action recognition. IEEE Trans. Ind. Inform. 2019, 16, 4059–4067. [Google Scholar] [CrossRef]
- Islam, M.S.; Sultana, S.; Kumar Roy, U.; Al Mahmud, J. A review on video classification with methods, findings, performance, challenges, limitations and future work. J. Ilm. Tek. Elektro Komput. Dan Inform. 2020, 6, 47–57. [Google Scholar] [CrossRef]
- Ullah, H.A.; Letchmunan, S.; Zia, M.S.; Butt, U.M.; Hassan, F.H. Analysis of Deep Neural Networks for Human Activity Recognition in Videos—A Systematic Literature Review. IEEE Access 2021, 9, 126366–126387. [Google Scholar] [CrossRef]
- ur Rehman, A.; Belhaouari, S.B.; Kabir, M.A.; Khan, A. On the Use of Deep Learning for Video Classification. Appl. Sci. 2023, 13, 2007. [Google Scholar] [CrossRef]
- Zhang, J.; Yu, X.; Lei, X.; Wu, C. A novel deep LeNet-5 convolutional neural network model for image recognition. Comput. Sci. Inf. Syst. 2022, 19, 1463–1480. [Google Scholar] [CrossRef]
- Fu’Adah, Y.N.; Wijayanto, I.; Pratiwi, N.K.C.; Taliningsih, F.F.; Rizal, S.; Pramudito, M.A. Automated classification of Alzheimer’s disease based on MRI image processing using convolutional neural network (CNN) with AlexNet architecture. J. Phys. Conf. Ser. 2021, 1844, 012020. [Google Scholar] [CrossRef]
- Tammina, S. Transfer learning using vgg-16 with deep convolutional neural network for classifying images. Int. J. Sci. Res. Publ. (IJSRP) 2019, 9, 143–150. [Google Scholar] [CrossRef]
- Butt, U.M.; Letchmunan, S.; Hassan, F.H.; Zia, S.; Baqir, A. Detecting video surveillance using VGG19 convolutional neural networks. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 1–9. [Google Scholar] [CrossRef]
- Kieffer, B.; Babaie, M.; Kalra, S.; Tizhoosh, H.R. Convolutional neural networks for histopathology image classification: Training vs. using pre-trained networks. In Proceedings of the Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA), Montreal, QC, Canada, 28 November 2017; pp. 1–6. [Google Scholar]
- Singla, A.; Yuan, L.; Ebrahimi, T. Food/non-food image classification and food categorization using pre-trained googlenet model. In Proceedings of the 2nd International Workshop on Multimedia Assisted Dietary Management, Amsterdam, The Netherlands, 16 October 2016; pp. 3–11. [Google Scholar]
- Kuttiyappan, D. Improving the Cyber Security over Banking Sector by Detecting the Malicious Attacks Using the Wrapper Stepwise Resnet Classifier. KSII Trans. Internet Inf. Syst. 2023, 17, 1657–1673. [Google Scholar]
- Hidayatuloh, A.; Nursalman, M.; Nugraha, E. Identification of tomato plant diseases by Leaf image using squeezenet model. In Proceedings of the International Conference on Information Technology Systems and Innovation (ICITSI), Bandung, Indonesia, 22 October 2018; pp. 199–204. [Google Scholar]
- Li, H. Image semantic segmentation method based on GAN network and ENet model. J. Eng. 2021, 10, 594–604. [Google Scholar] [CrossRef]
- Chen, Z.; Yang, J.; Chen, L.; Jiao, H. Garbage classification system based on improved ShuffleNet v2. Resour. Conserv. Recycl. 2022, 178, 106090. [Google Scholar] [CrossRef]
- Zhang, K.; Guo, Y.; Wang, X.; Yuan, J.; Ding, Q. Multiple feature reweight densenet for image classification. IEEE Access 2019, 7, 9872–9880. [Google Scholar] [CrossRef]
- Zhao, L.; He, Z.; Cao, W.; Zhao, D. Real-time moving object segmentation and classification from HEVC compressed surveillance video. IEEE Trans. Circuits Syst. Video Technol. 2016, 28, 1346–1357. [Google Scholar] [CrossRef]
- Sivasankaravel, V.S. Cost Effective Image Classification Using Distributions of Multiple Features. KSII Trans. Internet Inf. Syst. 2022, 16, 2154–2168. [Google Scholar]
- Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Li, F.-F. Large-Scale Video Classification with Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24 June 2014; pp. 1725–1732. [Google Scholar]
- Huang, D.; Zhang, L. Parallel Dense Merging Network with Dilated Convolutions for Semantic Segmentation of Sports Movement Scene. KSII Trans. Internet Inf. Syst. 2022, 16, 1–14. [Google Scholar]
- Selva, J.; Johansen, A.S.; Escalera, S.; Nasrollahi, K.; Moeslund, T.B.; Clapés, A. Video Transformers: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12922–12943. [Google Scholar] [CrossRef]
- Wang, T.; Zhang, R.; Lu, Z.; Zheng, F.; Cheng, R.; Luo, P. End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6847–6857. [Google Scholar]
- Gong, H.; Li, Q.; Li, C.; Dai, H.; He, Z.; Wang, W.; Li, H.; Han, F.; Tuniyazi, A.; Mu, T. Multi-scale Information Fusion for Hyperspectral Image Classification Based on Hybrid 2D-3D CNN. Remote Sens. 2021, 13, 2268. [Google Scholar] [CrossRef]
- Li, J. Parallel two-class 3D-CNN classifiers for video classification. In Proceedings of the 2017 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Xiamen, China, 6–9 November 2017; pp. 7–11. [Google Scholar]
- Jing, L.; Parag, T.; Wu, Z.; Tian, Y.; Wang, H. Videossl: Semi-supervised learning for video classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, virtual event, 5–9 January 2021; pp. 1110–1119. [Google Scholar]
- Wu, Z.; Jiang, Y.G.; Wang, X.; Ye, H.; Xue, X. Multi-stream multi-class fusion of deep networks for video classification. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 1 October 2016; pp. 791–800. [Google Scholar]
- Yue-Hei Ng, J.; Hausknecht, M.; Vijayanarasimhan, S.; Vinyals, O.; Monga, R.; Toderici, G. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 12 June 2015; pp. 4694–4702. [Google Scholar]
- Wu, Z.; Wang, X.; Jiang, Y.G.; Ye, H.; Xue, X. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proceedings of the 23rd ACM international Conference on Multimedia, Brisbane, Australia, 13 October 2015; pp. 461–470. [Google Scholar]
- Tavakolian, M.; Hadid, A. Deep discriminative model for video classification. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 382–398. [Google Scholar]
- Liu, M. Video Classification Technology Based on Deep Learning. In Proceedings of the 2020 International Conference on Information Science, Parallel and Distributed Systems (ISPDS), Xi’an, China, 14 August 2020; pp. 154–157. [Google Scholar]
- Varadarajan, B.; Toderici, G.; Vijayanarasimhan, S.; Natsev, A. Efficient large scale video classification. arXiv 2015, arXiv:1505.06250. [Google Scholar]
- Mihanpour, A.; Rashti, M.J.; Alavi, S.E. Human action recognition in video using DB-LSTM and ResNet. In Proceedings of the 2020 6th International Conference on Web Research (ICWR), Tehran, Iran, 22–23 April 2020; pp. 133–138. [Google Scholar]
- Jiang, Y.G.; Wu, Z.; Tang, J.; Li, Z.; Xue, X.; Chang, S.F. Modeling multi-modal clues in a hybrid deep learning framework for video classification. IEEE Trans. Multimed. 2018, 20, 3137–3147. [Google Scholar] [CrossRef]
- Long, X.; Gan, C.; Melo, G.; Liu, X.; Li, Y.; Li, F.; Wen, S. Multi-modal keyless attention fusion for video classification. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 1–8. [Google Scholar]
- de Oliveira Lima, J.P.; Figueiredo, C.M.S. A temporal fusion approach for video classification with convolutional and LSTM neural networks applied to violence detection. Intel. Artif. 2021, 24, 40–50. [Google Scholar] [CrossRef]
- Abdullah, M.; Ahmad, M.; Han, D. Facial expression recognition in videos: An CNN-LSTM based model for video classification. In Proceedings of the 2020 International Conference on Electronics, Information, and Communication, Barcelona, Spain, 19–22 January 2020; pp. 1–3. [Google Scholar]
- Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1933–1941. [Google Scholar]
- Fan, Y.; Lu, X.; Li, D.; Liu, Y. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In Proceedings of the 18th ACM International Conference on Multi-modal Interaction, Tokyo, Japan, 31 October 2016; pp. 445–450. [Google Scholar]
- Li, G.; Fang, Q.; Zha, L.; Gao, X.; Zheng, N. HAM: Hybrid attention module in deep convolutional neural networks for image classification. Pattern Recognit. 2022, 129, 108785. [Google Scholar] [CrossRef]
- Mekruksavanich, S.; Jitpattanakul, A. Hybrid convolution neural network with channel attention mechanism for sensor-based human activity recognition. Sci. Rep. 2023, 13, 12067. [Google Scholar] [CrossRef] [PubMed]
- Ullah, W.; Hussain, T.; Ullah, F.U.M.; Lee, M.Y.; Baik, S.W. TransCNN: Hybrid CNN and transformer mechanism for surveillance anomaly detection. Eng. Appl. Artif. Intell. 2023, 123, 106173. [Google Scholar] [CrossRef]
- Xu, G.; Li, W.; Liu, J. A social emotion classification approach using multi-model fusion. Future Gener. Comput. Syst. 2020, 102, 347–356. [Google Scholar] [CrossRef]
- Jagannathan, P.; Rajkumar, S.; Frnda, J.; Divakarachari, P.B.; Subramani, P. Moving vehicle detection and classification using gaussian mixture model and ensemble deep learning technique. Wirel. Commun. Mob. Comput. 2021, 2021, 5590894. [Google Scholar] [CrossRef]
- Kyrkou, C.; Bouganis, C.S.; Theocharides, T.; Polycarpou, M.M. Embedded hardware-efficient real-time classification with cascade support vector machines. IEEE Trans. Neural Netw. Learn. Syst. 2015, 27, 99–112. [Google Scholar] [CrossRef]
- Pérez, I.; Figueroa, M. A Heterogeneous Hardware Accelerator for Image Classification in Embedded Systems. Sensors 2021, 21, 2637. [Google Scholar] [CrossRef] [PubMed]
- Ruiz-Rosero, J.; Ramirez-Gonzalez, G.; Khanna, R. Field Programmable Gate Array Applications—A Scientometric Review. Computation 2019, 7, 63. [Google Scholar] [CrossRef]
- Mao, M.; Va, H.; Hong, M. Video Classification of Cloth Simulations: Deep Learning and Position-Based Dynamics for Stiffness Prediction. Sensors 2024, 24, 549. [Google Scholar] [CrossRef] [PubMed]
- Takahashi, R.; Matsubara, T.; Uehara, K. Data Augmentation Using Random Image Cropping and Patching for Deep CNNs. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 2917–2931. [Google Scholar] [CrossRef]
- Kim, E.K.; Lee, H.; Kim, J.Y.; Kim, S. Data Augmentation Method by Applying Color Perturbation of Inverse PSNR and Geometric Transformations for Object Recognition Based on Deep Learning. Appl. Sci. 2020, 10, 3755. [Google Scholar] [CrossRef]
- Taylor, L.; Nitschke, G. Improving Deep Learning with Generic Data Augmentation. In Proceedings of the IEEE Symposium Series on Computational Intelligence (SSCI), Bengaluru, India, 18–21 November 2018; pp. 1542–1547. [Google Scholar]
- Sayed, M.; Brostow, G. Improved Handling of Motion Blur in Online Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 1706–1716. [Google Scholar]
- Kim, E.; Kim, J.; Lee, H.; Kim, S. Adaptive Data Augmentation to Achieve Noise Robustness and Overcome Data Deficiency for Deep Learning. Appl. Sci. 2021, 11, 5586. [Google Scholar] [CrossRef]
- Diba, A.; Fayyaz, M.; Sharma, V.; Karami, A.H.; Arzani, M.M.; Yousefzadeh, R.; Van Gool, L. Temporal 3d convnets: New architecture and transfer learning for video classification. arXiv 2017, arXiv:1711.08200. [Google Scholar]
- Ramesh, M.; Mahesh, K. A Performance Analysis of Pre-trained Neural Network and Design of CNN for Sports Video Classification. In Proceedings of the International Conference on Communication and Signal Processing (ICCSP), Chennai, India, 28–30 June 2020; pp. 213–216. [Google Scholar]
- Aryal, S.; Porawagama, A.S.; Hasith, M.G.S.; Thoradeniya, S.C.; Kodagoda, N.; Suriyawansa, K. Using Pre-trained Models As Feature Extractor To Classify Video Styles Used In MOOC Videos. In Proceedings of the IEEE International Conference on Information and Automation for Sustainability (ICIAfS), Colombo, Sri Lanka, 21–22 December 2018; pp. 1–5. [Google Scholar]
- Wang, R.; Chen, D.; Wu, Z.; Chen, Y.; Dai, X.; Liu, M.; Jiang, Y.G.; Zhou, L.; Yuan, L. Bevt: Bert pre-training of video transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 14–18 June 2022; pp. 14733–14743. [Google Scholar]
- De Souza, C.R.; Gaidon, A.; Vig, E.; Lopez, A.M. System and Method for Video Classification Using a Hybrid Unsupervised and Supervised Multi-Layer Architecture. U.S. Patent 9,946,933, 17 April 2018. pp. 1–20. [Google Scholar]
- Jaouedi, N.; Boujnah, N.; Bouhlel, M.S. A new hybrid deep learning model for human action recognition. J. King Saud Univ.-Comput. Inf. Sci. 2020, 32, 447–453. [Google Scholar] [CrossRef]
- Kumaran, S.K.; Dogra, D.P.; Roy, P.P.; Mitra, A. Video trajectory classification and anomaly detection using hybrid CNN-VAE. arXiv 2018, arXiv:1812.07203. [Google Scholar]
- Ijjina, E.P.; Mohan, C.K. Hybrid deep neural network model for human action recognition. Appl. Soft Comput. 2016, 46, 936–952. [Google Scholar] [CrossRef]
- De Souza, C.R.; Gaidon, A.; Vig, E.; López, A.M. Sympathy for the details: Dense trajectories and hybrid classification architectures for action recognition. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 697–716. [Google Scholar]
- Lei, J.; Li, G.; Zhang, J.; Guo, Q.; Tu, D. Continuous action segmentation and recognition using hybrid convolutional neural network-hidden Markov model model. IET Comput. Vis. 2016, 10, 537–544. [Google Scholar] [CrossRef]
- Dash, S.C.B.; Mishra, S.R.; Srujan Raju, K.; Narasimha Prasad, L.V. Human action recognition using a hybrid deep learning heuristic. Soft Comput. 2021, 25, 13079–13092. [Google Scholar] [CrossRef]
- Xie, S.; Sun, C.; Huang, J.; Tu, Z.; Murphy, K. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-Offs in Video Classification. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 305–321. [Google Scholar]
- Moskalenko, V.V.; Zaretsky, M.O.; Moskalenko, A.S.; Panych, A.O.; Lysyuk, V.V. A model and training method for context classification in cctv sewer inspection video frames. Radio Electron. Comput. Sci. Control. 2021, 3, 97–108. [Google Scholar] [CrossRef]
- Naik, K.J.; Soni, A. Video Classification Using 3D Convolutional Neural Network. In Advancements in Security and Privacy Initiatives for Multimedia Images; IGI Global: Hershey, PA, USA, 2021; pp. 1–18. [Google Scholar]
- Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
- Solmaz, B.; Assari, S.M.; Shah, M. Classifying web videos using a global video descriptor. Mach. Vis. Appl. 2013, 24, 1473–1485. [Google Scholar] [CrossRef]
- Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
- Xu, H.; Das, A.; Saenko, K. Two-stream region convolutional 3D network for temporal activity detection. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2319–2332. [Google Scholar] [CrossRef] [PubMed]
- AVA Home Page. Available online: https://research.google.com/ava/ (accessed on 7 June 2024).
- Goyal, R.; Ebrahimi Kahou, S.; Michalski, V.; Materzynska, J.; Westphal, S.; Kim, H.; Haenel, V.; Fruend, I.; Yianilos, P.; Mueller-Freitag, M.; et al. The “something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5842–5850. [Google Scholar]
- Srivastava, S.; Sharma, G. Omnivec: Learning robust representations with cross modal sharing. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Walkoloa, HI, USA, 4–8 January 2024; pp. 1236–1248. [Google Scholar]
- Wu, W.; Wang, X.; Luo, H.; Wang, J.; Yang, Y.; Ouyang, W. Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6620–6630. [Google Scholar]
- Li, X.; Wang, L. ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video. arXiv 2023, arXiv:2310.01324. [Google Scholar]
- Wu, W.; Sun, Z.; Ouyang, W. Revisiting classifier: Transferring vision-language models for video recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 June 2023; pp. 2847–2855. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulum, HI, USA, 21–26 June 2017; pp. 6299–6308. [Google Scholar]
- Huang, G.; Bors, A.G. Busy-quiet video disentangling for video classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 1341–1350. [Google Scholar]
- Zhang, J.; Shen, F.; Xu, X.; Shen, H.T. Cooperative cross-stream network for discriminative action representation. arXiv 2019, arXiv:1908.10136. [Google Scholar]
- Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6450–6459. [Google Scholar]
- Hong, J.; Cho, B.; Hong, Y.W.; Byun, H. Contextual action cues from camera sensor for multi-stream action recognition. Sensors 2019, 19, 1382. [Google Scholar] [CrossRef]
- Zhao, Z.; Huang, B.; Xing, S.; Wu, G.; Qiao, Y.; Wang, L. Asymmetric Masked Distillation for Pre-Training Small Foundation Models. arXiv 2023, arXiv:2311.03149. [Google Scholar]
- Sharir, G.; Noy, A.; Zelnik-Manor, L. An image is worth 16x16 words, what is a video worth? arXiv 2021, arXiv:2103.13915. [Google Scholar]
- Zhu, L.; Tran, D.; Sevilla-Lara, L.; Yang, Y.; Feiszli, M.; Wang, H. FASTER Recurrent Networks for Efficient Video Classification. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13098–13105. [Google Scholar]
- Qiu, Z.; Yao, T.; Ngo, C.W.; Tian, X.; Mei, T. Learning spatio-temporal representation with local and global diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12056–12065. [Google Scholar]
- Zhang, Y.; Li, X.; Liu, C.; Shuai, B.; Zhu, Y.; Brattoli, B.; Chen, H.; Marsic, I.; Tighe, J. VidTr: Video Transformer without Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canda, 11–17 October 2021; pp. 13577–13587. [Google Scholar]
- Shou, Z.; Lin, X.; Kalantidis, Y.; Sevilla-Lara, L.; Rohrbach, M.; Chang, S.F.; Yan, Z. Dmc-net: Generating discriminative motion cues for fast compressed video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1268–1277. [Google Scholar]
- Chen, Y.; Kalantidis, Y.; Li, J.; Yan, S.; Feng, J. A2-nets: Double attention networks. Adv. Neural Inf. Process. Syst. 2018, 31, 1–10. [Google Scholar]
- Sun, S.; Kuang, Z.; Sheng, L.; Ouyang, W.; Zhang, W. Optical flow guided feature: A fast and robust motion representation for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1390–1399. [Google Scholar]
- Crasto, N.; Weinzaepfel, P.; Alahari, K.; Schmid, C. Motion-augmented rgb stream for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7882–7891. [Google Scholar]
- Liu, M.; Chen, X.; Zhang, Y.; Li, Y.; Rehg, J.M. Attention distillation for learning video representations. arXiv 2019, arXiv:1904.03249. [Google Scholar]
- Fan, L.; Huang, W.; Gan, C.; Ermon, S.; Gong, B.; Huang, J. End-to-end learning of motion representation for video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6016–6025. [Google Scholar]
- Huang, G.; Bors, A.G. Learning spatio-temporal representations with temporal squeeze pooling. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 2103–2107. [Google Scholar]
- Christoph, R.; Pinz, F.A. Spatiotemporal residual networks for video action recognition. Adv. Neural Inf. Process. Syst. 2016, 2, 3468–3476. [Google Scholar]
- Liu, Q.; Che, X.; Bie, M. R-STAN: Residual spatial-temporal attention network for action recognition. IEEE Access 2019, 7, 82246–82255. [Google Scholar] [CrossRef]
- Wang, L.; Li, W.; Li, W.; Van Gool, L. Appearance-And-Relation Networks for Video Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18 June 2018; pp. 1430–1439. [Google Scholar]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 20–36. [Google Scholar]
- Ma, C.Y.; Chen, M.H.; Kira, Z.; AlRegib, G. TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition. Signal Process. Image Commun. 2019, 71, 76–87. [Google Scholar] [CrossRef]
- Ranasinghe, K.; Naseer, M.; Khan, S.; Khan, F.S.; Ryoo, M.S. Self-supervised video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–20 June 2022; pp. 2874–2884. [Google Scholar]
- Tan, H.; Lei, J.; Wolf, T.; Bansal, M. Vimpac: Video pre-training via masked token prediction and contrastive learning. arXiv 2021, arXiv:2106.11250. [Google Scholar]
- Zhao, J.; Snoek, C.G. Dance with flow: Two-in-one stream action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9935–9944. [Google Scholar]
- Varol, G.; Laptev, I.; Schmid, C. Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1510–1517. [Google Scholar] [CrossRef]
- Wang, L.; Qiao, Y.; Tang, X. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4305–4314. [Google Scholar]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y. Towards good practices for very deep two-stream convnets. arXiv 2015, arXiv:1507.02159. [Google Scholar]
- Shalmani, S.M.; Chiang, F.; Zheng, R. Efficient action recognition using confidence distillation. In Proceedings of the 26th International Conference on Pattern Recognition, Montral, QC, Canada, 21–25 August 2022; pp. 3362–3369. [Google Scholar]
- Peng, X.; Schmid, C. Multi-region two-stream R-CNN for action detection. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 744–759. [Google Scholar]
- Bilen, H.; Fernando, B.; Gavves, E.; Vedaldi, A.; Gould, S. Dynamic image networks for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3034–3042. [Google Scholar]
- Simonyan, K.; Zisserman, A. Two-Stream Convolutional Networks for Action Recognition in Videos. Adv. Neural Inf. Process. Syst. 2014, 27, 1–9. [Google Scholar]
- Zhang, B.; Wang, L.; Wang, Z.; Qiao, Y.; Wang, H. Real-time action recognition with enhanced motion vector CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2718–2726. [Google Scholar]
- Nguyen, H.P.; Ribeiro, B. Video action recognition collaborative learning with dynamics via PSO-ConvNet Transformer. Sci. Rep. 2023, 13, 14624. [Google Scholar] [CrossRef] [PubMed]
- Tran, D.; Ray, J.; Shou, Z.; Chang, S.F.; Paluri, M. Convnet architecture search for spatiotemporal feature learning. arXiv 2017, arXiv:1708.05038. [Google Scholar]
- Ng, J.Y.H.; Choi, J.; Neumann, J.; Davis, L.S. Actionflownet: Learning motion representation for action recognition. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1616–1624. [Google Scholar]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chils, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
- Parmar, P.; Morris, B. HalluciNet-ing spatiotemporal representations using a 2D-CNN. Signals 2021, 2, 604–618. [Google Scholar] [CrossRef]
- Pan, T.; Song, Y.; Yang, T.; Jiang, W.; Liu, W. Videomoco: Contrastive video representation learning with temporally adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11205–11214. [Google Scholar]
- Mazari, A.; Sahbi, H. MLGCN: Multi-Laplacian graph convolutional networks for human action recognition. In Proceedings of the British Machine Vision Conference, Cardiff, UK, 9–12 September 2019; pp. 1–27. [Google Scholar]
- Zhu, Y.; Long, Y.; Guan, Y.; Newsam, S.; Shao, L. Towards universal representation for unseen action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 9436–9445. [Google Scholar]
- Choutas, V.; Weinzaepfel, P.; Revaud, J.; Schmid, C. Potion: Pose motion representation for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7024–7033. [Google Scholar]
Abbreviations | Description |
---|---|
3DCLSTM | 3D Convolutional Long Short-Term Memory |
Adam | Adaptive Momentum |
AMD | Asymmetric Masked Distillation |
ARTNet | Appearance-and-Relation Network |
BIKE | BIdirectional Crossmodal Knowledge Exploration |
C3D | 3D Convolutional Network |
CCS | Cooperative Cross-Stream |
CD-UAR | Cross-Dataset UAR |
CNN | Convolutional Neural Network |
CNNs | Convolutional Neural Networks |
CRNNs | Convolutional Recurrent Neural Networks |
DB-LSTM | Deep Bidirectional LSTM |
DMC-Net | Discriminative Motion Cue |
FASTER32 | Feature Aggregation for Spatio-Temporal Redundancy (32 Frames) |
FPGA | Field-Programmable Gate Array |
HalluciNet | Hallucination Network |
HAM | Hybrid Attention Model |
HFLSTD | Histogram of fuzzy local spatio-temporal descriptors |
I3D | Interactive Three-Dimensional |
iFDT | Improved fuzzy dense trajectories |
LGD-3D Flow | Local Global Diffusion |
LSTM | Long Short-Term Memory |
LTC | Long-Term Temporal Convolutions |
MARS | Motion-Augmented RGB Stream |
MiCRObE | Max Calibration mixture of Experts |
MLGCN | Multi-Laplacian Graph Convolutional Network |
MR Two-Sream R-CNN | Multi-Region Two-Stream R-CNN |
MV-CNN | Motion Vector CNN |
OmniVec | Omnidirectional Vector |
PoTion | Pose moTion |
Prob-Distill | Probabilistic Distribution |
ReLU | Rectified Linear Unit |
Res3D | Residual Three Dimensional |
RGB | Red–Green–Blue |
RGB-I3D | Red–Green–Blue—Interactive Three Dimensional |
RNNs | Recurrent Neural Networks |
R-STAN | Residual Spatial–Temporal Attention Network |
SGD | Stochastic Gradient Descent |
STAM | Space–Time Attention Model |
ST-ResNet | Spatio-temporal Residual Network |
SVT | Self-supervised Video Transformer |
TDD | Trajectory-pooled Deep-convolutional Descriptor |
TS-LSTM | Temporal Segment LSTM |
TSN | Temporal Segment Network |
UAR | Unseen Action Recognition |
VidTr-L | Video Transformer—Large |
VIMPAC | Video pre-training via Masked token Prediction And Contrastive learning |
ZeroI2V | Zero-cost Adaptation Paradigm |
Ref. | Year | Features | Drawbacks |
---|---|---|---|
Anushya [6] | 2020 | Video classification, tagging, and clustering. | Limited scope and lacks detailed information. |
Rani et al. [7] | 2020 | Classify video content using text, audio, and visual features. | Did not include an analysis of the latest state-of-the-art approaches. |
Li et al. [8] | 2020 | Real-time sports video classification. | Focuses specifically on real sports video classification. |
Zuo et al. [9] | 2020 | Fuzzy local spatio-temporal descriptors for video action recognition. | Uncertainty in pixel voting due to varying numbers of bins. |
Islam et al. [10] | 2021 | Machine learning techniques for classifying video. | Reviews are less focused on deep learning methods. |
Ullah et al. [11] | 2021 | Recognizing human activities with deep learning. | Primarily emphasizes human activity recognition. |
Rehman et al. [12] | 2022 | Detailed review of deep learning strategies for classifying videos. | Places less emphasis on pre-training and foundational model techniques in deep learning for video classification. |
This study | 2024 | Comprehensive techniques for video classification, dataset benchmarking, and deep learning models. | - |
Year | Model | Key Features | Contributions |
---|---|---|---|
1998 | LeNet-5 | Five layers, convolutional and pooling layers. | Pioneered CNNs for digit recognition. |
2012 | AlexNet | Eight layers, ReLU activation, dropout. | Won ImageNet 2012, popularized deep learning. |
2014 | VGGNet | 16–19 layers, small (3 × 3) convolution filters. | Demonstrated the importance of depth. |
2014 | GoogLeNet | Inception modules, 22 layers. | Improved computational efficiency. |
2015 | ResNet | Residual blocks, up to 152 layers. | Enabled training of intense networks. |
2015 | YOLO | Real-time object detection. | Unified detection and classification, efficient for video analysis. |
2016 | SqueezeNet | Fire modules, 50× fewer parameters. | Achieved AlexNet-level accuracy with fewer parameters. |
2016 | ENet | Compound scaling. | Achieved state-of-the-art accuracy with fewer parameters. |
2017 | ShuffleNet | Point-wise group convolution, channel shuffle. | Efficient computation for mobile devices |
2017 | DenseNet | Dense connections between layers. | Promoted feature reuse reduced parameters. |
2017 | MobileNet | Depth-wise separable convolutions. | Optimized for mobile and embedded vision applications. |
2019 | BowNet | Encoder–decoder structure. | Real-time tracking of tongue contours in ultrasound data. |
Approach | Features | Model | Evaluations | Problems |
---|---|---|---|---|
Zhang et al. [13] | Modify the logarithmic Rectified Linear Unit (L_ReLU) of the activation function. | LeNet-5 | ReLU | The challenges include high hardware requirements, large training sample size, extended training time, slow convergence speed, and low accuracy. |
Fu‘adah et al. [14] | Automated classification system for images using AlexNet. | AlexNet | Adam, binary cross-entropy | AlexNet architecture was employed to develop an automated object classification system for Alzheimer’s disease. |
Tammina [15] | Classification, regression, and clustering. | VGG-16 | Binary cross-entropy | Image classification problem with the restriction of having a small number of training samples per category. |
Butt et al. [16] | Street crime snatching and theft detection in video mining. | VGG-19 | ReLU, softmax | The meteoric growth of the Internet has made mining and extracting valuable patterns from a large dataset challenging. |
Kieffer et al. [17] | Classification. | Inception | Linear, SVM | The task involves retrieving and classifying histopathological images to analyze diagnostic pathology. |
Singla et al. [18] | Food image classification. | GoogleNet | Binary classification | Food image classification and recognition are crucial steps for dietary assessment. |
Kuttiyappan et al. [19] | Hierarchical network feature extraction. | ResNet | Adam | Improving the cybersecurity of the bank sector by proving malicious attacks using the wrapper step-wise. |
Hidayatuloh et al. [20] | Detection and diagnosis of plant diseases. | SqueezeNet | Adam | Identify the types of diseases on the leaves of tomato plants and their healthy leaves. |
Li [21] | Image semantic segmentation. | Enet | MioU | Improve the network model of the generative adversarial network. |
Chen et al. [22] | Garbage classification. | ShuffleNet | Cross entropy, SGD | Improve the consistency, stability, and sanitary conditions for garbage classification. |
Zhang et al. [23] | Multiple features reweight DenseNet. | DenseNet | SGD | Adaptively recalibrating the channel-wise feature and explicitly modeling the interdependence between the features of different convolutional layers. |
Approach | Architecture | Features | Operational Mechanism |
---|---|---|---|
Li [31] | 3D-CNN | Multi-class, temporally downsampled, increment of the new class. | Utilizing each 3D-CNN as a binary classifier for a distinct video class streamlines training and decreases computational overhead. |
Jing et al. [32] | Semi-supervised | Supervisory signals extracted from unlabeled data, 2D images for semi-supervised learning of 3D video clips. | Three loss functions are employed to optimize the 3D network: video cross-entropy loss on labeled data, pseudo-cross-entropy loss on unlabeled data’s pseudo-labels, and soft cross-entropy loss on both labeled and unlabeled data to capture appearance features. |
Wu et al. [33] | Multi-Stream, ConvNets | Multi-stream, multi-class. | Effectively recognizes video semantics with precise and discriminative appearance characteristics; motion stream traina ConvNet model operates on stacked optical flows. |
Yue-Hei Ng et al. [34] | Convolutional temporal | CNN feature computation, feature aggregation. | Pooling feature methods that were max-pooling local information through time and LSTM, whose hidden state evolves with each sequential frame. |
Wu et al. [35] | Short-term motion | Short-term spatial–motion patterns, long-term temporal clues. | Extracts spatial and motion features with two CNNs trained on static frames and stacked optical flow. |
Tavakolian et al. [36] | Heterogeneous Deep Discriminative Model (HDDM) | Unsupervised pre-training, redundancy-adjacent frames, spatio-temporal variation patterns. | HDDM weights are initialized by an unsupervised layer-wise pre-training stage using Gaussian Restricted Boltzmann Machines (GRBM) |
Liu et al. [37] | Simple Recurrent Units method (SRU) | Feature extraction, feature fusion, and similarity measurement. | SRU network can obtain the overall characterization of video features to a certain extent through average pooling. |
Varadarajan et al. [38] | Max Calibration mixtuRe of Experts (MiCRObE) | Hand-crafted. | MiCRObE can be used as a frame-level classification that does not require human-selected and frame-level ground truth. |
Mihanpour et al. [39] | Deep bidirectional LSTM (DB-LSTM) | Frame extraction, forward and backward passes of DB-LSTM. | The DB-LSTM recurrent network is used in forward and backward transitions, and the final classification is performed. |
Jiang et al. [40] | Hybrid deep learning | Multi-model clues, static spatial motion patterns. | Integrating a comprehensive set of multi-modal clues for video categorization by employing three independent CNN models: one operating on static frames, another on stacked optical flow images, and the third on audio spectrograms to compute spatial, motion, and audio features. |
Datasets | # of Videos | Resolutions | # of Classes | Year |
---|---|---|---|---|
KTH | 2.391 | 160 × 120 | 6 | 2004 |
Weizmann | 81 | 180 × 144 | 9 | 2005 |
Kodak | 1.358 | 768 × 512 | 25 | 2007 |
Hollywood | 430 | 400 × 300, 300 × 200 | 8 | 2008 |
YouTube Celebrities Face | 1.910 | - | 47 | 2008 |
Hollywood2 | 1.787 | 400 × 300, 300 × 200 | 12 | 2009 |
UCF11 | 1.600 | 720 × 480 | 1600 | 2009 |
UCF sports | 150 | 720 × 480 | 10 | 2009 |
MCG-WEBV | 234.414 | - | 15 | 2009 |
Olympic Sports | 800 | 90 × 120 | 16 | 2010 |
HMDB51 | 6.766 | 320 × 240 | 51 | 2011 |
CCV | 9.317 | - | 20 | 2011 |
JHMDB | 960 | - | 21 | 2011 |
UCF-101 | 133.20 | 320 × 240 | 101 | 2012 |
THUMOS 2014 | 183.94 | - | 101 | 2014 |
MED-2014 (Dev. set) | 31.000 | - | 20 | 2014 |
Sports-1M | 113.3158 | 320–240 | 487 | 2014 |
MPII Human Pose | 25 K | - | 410 | 2014 |
ActivityNet | 279.01 | 1280 × 720 | 203 | 2015 |
EventNet | 953.21 | - | 500 | 2015 |
FCVID | 912.23 | - | 239 | 2015 |
Kinetics | 650.000 | - | 400, 600, 700 | 2017 |
Something-something V1 | 110.000 | 100 × (~) | 174 | 2017 |
YouTube-8M | 6.1 M | - | 3862 | 2018 |
Moments in Time | 802.264 | 340 × 256 | 339 | 2018 |
EPIC-KITCHENS | 396 K | - | 149 | 2018 |
Charades-Ego | 685.36 | - | 157 | 2019 |
AVA-Kinetics | 230 k | - | 80 | 2020 |
Something-something V2 | 220.847 | - | 174 | 2021 |
Attributes | Single Deep Learning | Hybrid Deep Learning |
---|---|---|
Applications | Limited application | Diverse application |
Classifier diversity | Limited to Softmax | Softmax or ML-based |
Feature extraction | Limited scope | Larger scope |
Hardware resource | Uses little resources | Uses more resources |
Performance evaluation | Low performance | Superior performance |
Program complexity | Low complexity | High complexity |
Transfer learning | Limited options for transfer learning | More options for transfer learning |
Approach | Features | Model Architectures | Datasets Used | Problem | Results/Findings | Year |
---|---|---|---|---|---|---|
Wu et al. [35] | Spatial, short-term motion. | CNN + LSTM | UCF-101, CCV | Content semantics | UCF-101: 91.3%, CCV: 83.5%. | 2015 |
Jiang et al. [40] | Corresponding features, motion features, and multi-modal features. | CNN + LSTM | UCF-101, CCV | Multi-modal clues | UCF-101: 93.1%, CCV: 84.5%. | 2018 |
De Souza et al. [64] | Spatio-temporal features. | FV-SVM | UCF-101, HMDB-51 | Content of video | UCF-101: 90.6%, HMDB-51: 67.8%. | 2018 |
Jaouedi et al. [65] | Spatial features: reduce the size of the data processed; various object features. | GMM + KF + GRNN | UCF Sport, UCF-101, KTH | Facilitate clues | KTH: 96.30% | 2020 |
Zuo et al. [9] | Fuzzy local spatio-temporal descriptors | HFLSTD + iFDT | UCF-50, UCF-101 | Uncertainty in pixel voting due to varying numbers of bins | UCF-50: 95.4%, UCF-101: 97.3% | 2020 |
Wu et al. [33] | Multi-modal | CNN + LSTM | UCF-101, CCV | Multi-stream | UCF-101: 92.2%, CCV: 84.9% | 2016 |
Kumaran et al. [66] | Latent features, spatio-temporal features. | CNN-VAE | T15, QMUL, 4WAY | Classify the times series | T15: 99.0%, QMUL: 97.3%, 4WAY: 99.5 | 2018 |
Ijjina et al. [67] | Action bank features. | Hybrid deep neural network + CNN | UCF50 | - | UCF50: 99.68% | 2015 |
De Souza et al. [68] | Hand-crafted, spatio-temporal features. | iDT + STA + DAFS + DN | UCF-101, HMDB-51, Hollywood2, High-Five, Olympics | Large video data | UCF-101: 90.6%, HMDB-51: 67.8%, Hollywood2: 69.1%, High-Five: 71.0%, Olympics: 92.8%. | 2016 |
Lei et al. [69] | High-level features, robust action features. | CNN-HMM | Weizmann, KTH | Complex temporal dynamics | Weizmann: 89.2%, KTH: 93.97% | 2016 |
Dash et al. [70] | Sophisticated hand-crafted motion features. | SIFT-CNN | UCF, KTM | Action recognition | UCF: 89.5%, KTM: 90%. | 2021 |
Evaluation Metric | Year of Publication | Reference |
---|---|---|
Accuracy | 2018 | Xie et al. [71] |
Precision | 2014 | Karpathy et al. [26] |
Recall | 2016 | Abu-El-Haija et al. [4] |
F1 score | 2021 | de Oliveira Lima et al. [42] |
Micro-F1 score | 2021 | Moskalenko et al. [72] |
K-Fold cross-validation | 2021 | Naik et al. [73] |
Top-k | 2015 | Varadarajan et al. [38] |
Ref. | Dataset | Characteristics | Challenges | Suitability for Tasks |
---|---|---|---|---|
[74] | UCF-101 | Consists of 101 action categories. | Limited diversity in activities and scenarios. | Basic action recognition. |
[75] | HMDB-51 | Contains videos from a diverse set of activities. | The limited number of samples per class. | Basic action recognition. |
[76] | Kinetics-400 | Large-scale dataset with 400 action classes. | Requires significant computation resources. | Complex action recognition. |
[77] | ActivityNet | Contains untrimmed videos annotated with activities. | Temporal localization and annotation. | Activity detection and temporal action localization. |
[78] | AVA | Focuses on human–object interactions in video. | Requires fine-grained action annotations. | Human–object interaction recognition. |
[79] | Something-something v. 2 | Addresses fine-grained action recognition with interventions involving everyday objects. | Limited in vocabulary and scale. | Fine-grained action recognition. |
Ref. | Method | Accuracy (%) | Year of Publication |
---|---|---|---|
Varadarajan et al. [38] | OmniVec | 99.6 | 2023 |
Wu et al. [81] | BIKE | 98.9 | 2022 |
Li et al. [82] | ZeroI2V ViT-L/14 | 98.6 | 2023 |
Wu et al. [83] | Text4Vis | 98.2 | 2022 |
Carreira et al. [84] | Two-Stram I3D | 98.0 | 2017 |
Huang et al. [85] | BQN | 97.6 | 2021 |
Zhang et al. [86] | CCS + TSN | 97.4 | 2019 |
Tran et al. [87] | R [2 + 1]D-TwoStream | 97.3 | 2017 |
Hong et al. [88] | Multi-stream I3D | 97.2 | 2019 |
Zhao et al. [89] | AMD | 97.1 | 2023 |
Sharir et al. [90] | STAM-32 | 97.0 | 2021 |
Zhu et al. [91] | FASTER32 | 96.9 | 2019 |
Qiu et al. [92] | LGD-3D Flow | 96.8 | 2019 |
Zhang et al. [93] | VidTr-L | 96.7 | 2021 |
Shou et al. [94] | I3D RGB + DMC-Net | 96.5 | 2019 |
Chen et al. [95] | A2-Net | 96.4 | 2018 |
Sun et al. [96] | Optical-Flow-Guided Feature | 96.0 | 2017 |
Crasto et al. [97] | MARS + RGB + Flow | 95.8 | 2019 |
Liu et al. [98] | Prob-Distill | 95.7 | 2019 |
Carreira et al. [84] | RGB-I3D | 95.6 | 2017 |
Tran et al. [87] | R[2 + 1]D-Flow | 95.5 | 2017 |
Fan et al. [99] | TVNet + IDT | 95.4 | 2018 |
Huang et al. [100] | TesNet | 95.2 | 2020 |
Carreira et al. [84] | RGB-I3D | 95.1 | 2017 |
Tran et al. [87] | R[2 + 1]D-TwoStream | 95.0 | 2017 |
Christoph et al. [101] | ST-ResNet + IDT | 94.6 | 2016 |
Liu et al. [102] | R-STAN-101 | 94.5 | 2019 |
Wang et al. [103] | ARTNet with TSN | 94.3 | 2018 |
Wang et al. [104] | Temporal Segment Networks | 94.2 | 2016 |
Ma et al. [105] | TS-LSTM | 94.1 | 2017 |
Ranasinghe et al. [106] | SVT + ViT-B | 93.7 | 2021 |
Tran et al. [87] | R[2 + 1]D-RGB | 93.6 | 2017 |
Carreira et al. [84] | Two-stream I3D | 93.4 | 2017 |
Tran et al. [87] | R[2 + 1]D-Flow | 93.3 | 2017 |
Tan et al. [107] | VIMPAC | 92.7 | 2021 |
Feichtenhofer et al. [44] | S:VGG-16, T:VGG-16 | 92.5 | 2016 |
Shou et al. [94] | DMC-Net | 92.3 | 2019 |
Zhao et al. [108] | Two-in-one two stream | 92.0 | 2019 |
Varol et al. [109] | LTC | 91.7 | 2016 |
Wang et al. [110] | TDD + IDT | 91.5 | 2015 |
Wang et al. [111] | Very deep two-stream ConvNet | 91.4 | 2015 |
Shalmani et al. [112] | 3D ResNeXt-101 + Confidence Distillation | 91.2 | 2021 |
Peng et al. [113] | MR Two-Sream R-CNN | 91.1 | 2016 |
Ranasinghe et al. [106] | SVT | 90.8 | 2021 |
Bilen et al. [114] | Dynamic Image Networks + IDT | 89.1 | 2016 |
Yue-Hei Ng et al. [34] | Two-steam + LSTM | 88.6 | 2015 |
Simonyan et al. [115] | Two-Stream | 88.0 | 2014 |
Zhang et al. [116] | MV-CNN | 86.4 | 2016 |
Nguyen et al. [117] | Dynamics 2 for DenseNet-201 Transformer | 86.1 | 2023 |
Tran et al. [118] | Res3D | 85.8 | 2017 |
Ng Hei-Yue et al. [119] | ActioFlowNet | 83.9 | 2016 |
Tran et al. [120] | C3D | 82.3 | 2014 |
Parmar et al. [121] | HalluciNet | 79.8 | 2019 |
Pan et al. [122] | R[2 + 1]D | 78.7 | 2021 |
Pan et al. [122] | 3D-ResNet18 | 74.1 | 2021 |
Karpathy et al. [26] | Slow Fusion + Fine-Tune Top 3 layers | 65.4 | 2014 |
Mazari et al. [123] | MLGCN | 63.2 | 2019 |
Zhu et al. [124] | CD-UAR | 42.5 | 2018 |
Choutas et al. [125] | I3D + PoTion | 29.3 | 2018 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mao, M.; Lee, A.; Hong, M. Deep Learning Innovations in Video Classification: A Survey on Techniques and Dataset Evaluations. Electronics 2024, 13, 2732. https://doi.org/10.3390/electronics13142732
Mao M, Lee A, Hong M. Deep Learning Innovations in Video Classification: A Survey on Techniques and Dataset Evaluations. Electronics. 2024; 13(14):2732. https://doi.org/10.3390/electronics13142732
Chicago/Turabian StyleMao, Makara, Ahyoung Lee, and Min Hong. 2024. "Deep Learning Innovations in Video Classification: A Survey on Techniques and Dataset Evaluations" Electronics 13, no. 14: 2732. https://doi.org/10.3390/electronics13142732