Advances in Human Action Recognition Using Deep Learning

A special issue of Journal of Imaging (ISSN 2313-433X).

Deadline for manuscript submissions: closed (30 June 2022) | Viewed by 16391

Special Issue Editors


E-Mail Website
Guest Editor
1. Australian Centre for Robotic Vision, Australian National University, Canberra, ACT 2600, Australia
2. Mitsubishi Electric Research Labs, Cambridge, MA 02139, USA
Interests: multimodal video understanding; activity recognition

E-Mail Website
Guest Editor
Institute for High Performance Computing, Agency for Science, Technology, and Research (A*STAR), 1 Fusionopolis Way #08-10 Connexis North, Singapore 138632, Singapore
Interests: action recognition; transfer learning; video forecasting

Special Issue Information

Dear Colleagues,

The topic of human action recognition in video sequences is a problem of immense applicability in a variety of real-world situations, including but not limited to (i) video surveillance, such as monitoring for fatalities in elderly homes or ensuring safety in public places, (ii) autonomous driving, when a driving agent needs to predict the next action a pedestrian standing at an intersection may consider, (iii) learning from instruction videos, when one needs to quickly find out the next step to cook a dish, (iv) medical diagnosis, such as in early diagnosis of autism-spectrum disorders in children, or (v) even as leisurely as to summarize a long movie; all these tasks will need human action recognition as a fundamental ingredient in their solutions.

Embarking on developments in deep neural networks, the problem of human action recognition has seen significant research strides recently. Starting from the success of two-stream neural architectures, deep-learning-based advancements in action recognition have moved into the design of sophisticated neural networks based on contrastive learning, transformers, and 3D-CNNs and is growing fast towards matching human performance.

Nonetheless, the performances of action recognition system are still quite far from the helm that object recognizers adorn today. This performance gap is perhaps due to the unique challenges the video modality poses from computational and algorithmic perspectives. For example, the temporal evolution of objects in videos brings in a challenging dimension to recognition that would potentially need larger training sets for learning to span the space of actions, while also demanding neural models of larger capacity. This problem is further exacerbated due to the fact that real-world video sequences often contain camera motions, varied camera poses, object/actor occlusions, and video-specific non-stationary noise, such as motion blur, among many others. Thus, a general-purpose deep-learning-based action recognition system needs to cater to these varied challenges for its success in the real world.

To this end, in this Special Issue, we envisage a unique venue to publish high-quality research that orbits around the aforementioned issues in human action recognition using deep learning techniques. We look forward to submissions that address all aspects of action recognition and aim to publish research that is unique in its approach, generalizes across varied data conditions, and demonstrates good empirical performances.

Dr. Anoop Cherian
Dr. Basura Fernando
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Journal of Imaging is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • deep learning architectures for human action recognition
  • multimodal video representation learning for action recognition including skeletons, depth etc.
  • self-supervised/contrastive approaches to action recognition
  • generative/adversarial/variational approaches to action recognition
  • graph neural networks/transformers and variants for action recognition
  • geometric approaches to action recognition
  • novel tasks/applications and datasets based on action recognition
  • few-shot/weakly-supervised methods for action recognition
  • efficient deep learning for action recognition

Published Papers (4 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

12 pages, 4822 KiB  
Article
Multi-Camera Multi-Person Tracking and Re-Identification in an Operating Room
by Haowen Hu, Ryo Hachiuma, Hideo Saito, Yoshifumi Takatsume and Hiroki Kajita
J. Imaging 2022, 8(8), 219; https://doi.org/10.3390/jimaging8080219 - 17 Aug 2022
Cited by 3 | Viewed by 3646
Abstract
Multi-camera multi-person (MCMP) tracking and re-identification (ReID) are essential tasks in safety, pedestrian analysis, and so on; however, most research focuses on outdoor scenarios because they are much more complicated to deal with occlusions and misidentification in a crowded room with obstacles. Moreover, [...] Read more.
Multi-camera multi-person (MCMP) tracking and re-identification (ReID) are essential tasks in safety, pedestrian analysis, and so on; however, most research focuses on outdoor scenarios because they are much more complicated to deal with occlusions and misidentification in a crowded room with obstacles. Moreover, it is challenging to complete the two tasks in one framework. We present a trajectory-based method, integrating tracking and ReID tasks. First, the poses of all surgical members captured by each camera are detected frame-by-frame; then, the detected poses are exploited to track the trajectories of all members for each camera; finally, these trajectories of different cameras are clustered to re-identify the members in the operating room across all cameras. Compared to other MCMP tracking and ReID methods, the proposed one mainly exploits trajectories, taking texture features that are less distinguishable in the operating room scenario as auxiliary cues. We also integrate temporal information during ReID, which is more reliable than the state-of-the-art framework where ReID is conducted frame-by-frame. In addition, our framework requires no training before deployment in new scenarios. We also created an annotated MCMP dataset with actual operating room videos. Our experiments prove the effectiveness of the proposed trajectory-based ReID algorithm. The proposed framework achieves 85.44% accuracy in the ReID task, outperforming the state-of-the-art framework in our operating room dataset. Full article
(This article belongs to the Special Issue Advances in Human Action Recognition Using Deep Learning)
Show Figures

Figure 1

16 pages, 1297 KiB  
Article
Skeleton-Based Attention Mask for Pedestrian Attribute Recognition Network
by Sorn Sooksatra and Sitapa Rujikietgumjorn
J. Imaging 2021, 7(12), 264; https://doi.org/10.3390/jimaging7120264 - 4 Dec 2021
Cited by 1 | Viewed by 2347
Abstract
This paper presents an extended model for a pedestrian attribute recognition network utilizing skeleton data as a soft attention model to extract a local feature corresponding to a specific attribute. This technique helped keep valuable information surrounding the target area and handle the [...] Read more.
This paper presents an extended model for a pedestrian attribute recognition network utilizing skeleton data as a soft attention model to extract a local feature corresponding to a specific attribute. This technique helped keep valuable information surrounding the target area and handle the variation of human posture. The attention masks were designed to focus on the partial and the whole-body regions. This research utilized an augmented layer for data augmentation inside the network to reduce over-fitting errors. Our network was evaluated in two datasets (RAP and PETA) with various backbone networks (ResNet-50, Inception V3, and Inception-ResNet V2). The experimental result shows that our network improves overall classification performance with a mean accuracy of about 2–3% in the same backbone network, especially local attributes and various human postures. Full article
(This article belongs to the Special Issue Advances in Human Action Recognition Using Deep Learning)
Show Figures

Figure 1

Review

Jump to: Research

27 pages, 4503 KiB  
Review
A Comprehensive Review on Temporal-Action Proposal Generation
by Sorn Sooksatra and Sitapa Watcharapinchai
J. Imaging 2022, 8(8), 207; https://doi.org/10.3390/jimaging8080207 - 23 Jul 2022
Cited by 1 | Viewed by 1967
Abstract
Temporal-action proposal generation (TAPG) is a well-known pre-processing of temporal-action localization and mainly affects localization performance on untrimmed videos. In recent years, there has been growing interest in proposal generation. Researchers have recently focused on anchor- and boundary-based methods for generating action proposals. [...] Read more.
Temporal-action proposal generation (TAPG) is a well-known pre-processing of temporal-action localization and mainly affects localization performance on untrimmed videos. In recent years, there has been growing interest in proposal generation. Researchers have recently focused on anchor- and boundary-based methods for generating action proposals. The main purpose of this paper is to provide a comprehensive review of temporal-action proposal generation with network architectures and empirical results. The pre-processing step for input data is also discussed for network construction. The content of this paper was obtained from the research literature related to temporal-action proposal generation from 2012 to 2022 for performance evaluation and comparison. From several well-known databases, we used specific keywords to select 71 related studies according to their contributions and evaluation criteria. The contributions and methodologies are summarized and analyzed in a tabular form for each category. The result from state-of-the-art research was further analyzed to show its limitations and challenges for action proposal generation. TAPG performance in average recall ranges from 60% up to 78% in two TAPG benchmarks. In addition, several future potential research directions in this field are suggested based on the current limitations of the related studies. Full article
(This article belongs to the Special Issue Advances in Human Action Recognition Using Deep Learning)
Show Figures

Figure 1

19 pages, 704 KiB  
Review
A Structured and Methodological Review on Vision-Based Hand Gesture Recognition System
by Fahmid Al Farid, Noramiza Hashim, Junaidi Abdullah, Md Roman Bhuiyan, Wan Noor Shahida Mohd Isa, Jia Uddin, Mohammad Ahsanul Haque and Mohd Nizam Husen
J. Imaging 2022, 8(6), 153; https://doi.org/10.3390/jimaging8060153 - 26 May 2022
Cited by 38 | Viewed by 7752
Abstract
Researchers have recently focused their attention on vision-based hand gesture recognition. However, due to several constraints, achieving an effective vision-driven hand gesture recognition system in real time has remained a challenge. This paper aims to uncover the limitations faced in image acquisition through [...] Read more.
Researchers have recently focused their attention on vision-based hand gesture recognition. However, due to several constraints, achieving an effective vision-driven hand gesture recognition system in real time has remained a challenge. This paper aims to uncover the limitations faced in image acquisition through the use of cameras, image segmentation and tracking, feature extraction, and gesture classification stages of vision-driven hand gesture recognition in various camera orientations. This paper looked at research on vision-based hand gesture recognition systems from 2012 to 2022. Its goal is to find areas that are getting better and those that need more work. We used specific keywords to find 108 articles in well-known online databases. In this article, we put together a collection of the most notable research works related to gesture recognition. We suggest different categories for gesture recognition-related research with subcategories to create a valuable resource in this domain. We summarize and analyze the methodologies in tabular form. After comparing similar types of methodologies in the gesture recognition field, we have drawn conclusions based on our findings. Our research also looked at how well the vision-based system recognized hand gestures in terms of recognition accuracy. There is a wide variation in identification accuracy, from 68% to 97%, with the average being 86.6 percent. The limitations considered comprise multiple text and interpretations of gestures and complex non-rigid hand characteristics. In comparison to current research, this paper is unique in that it discusses all types of gesture recognition techniques. Full article
(This article belongs to the Special Issue Advances in Human Action Recognition Using Deep Learning)
Show Figures

Figure 1

Back to TopTop