**About the Editor**

#### **Hyo Jong Lee**

Hyo Jong Lee received his B.S., M.S., and Ph.D. degrees in computer science from the University of Utah, USA, where he was involved in computer graphics, image processing, and parallel processing. He is currently Professor of the Division of Computer Science, Jeonbuk National University, Jeonju, South Korea. He is also the president of AI Tech Co., Ltd. and member of IEEE Computing Society and the Association for Computing Machinery. He shares authorship of over 120 papers. His research interests include image processing, artificial intelligence, medical imaging, parallel algorithms, and brain science.

### *Editorial* **Special Issue on Deep Learning-Based Action Recognition**

**Hyo Jong Lee**

Division of Computer Science and Engineering, CAIIT, Jeonbuk National University, Jeonju 54896, Korea; hlee@jbnu.ac.kr

#### **1. Introduction**

Human action recognition (HAR) has gained popularity because of its various applications, such as human–object interaction [1], intelligent surveillance [2], virtual reality [3], and autonomous driving [4]. The demand for HAR applications as well as gesture and pose estimation is growing rapidly. In response to this growing demand, various methods to apply human action recognition have been introduced. Features from images or videos can be extracted by multiple descriptors, such as local binary pattern, scale-invariant feature transformation, histogram of oriented gradient, and histogram of optic flow identifying action types. Recently, deep learning networks have been deployed in many challenging areas, such as image classification and object detection. Action recognition is also an ideal area for the application of deep learning networks. One of the primary advantages of deep learning is its ability to automatically learn representative features from large-scale data. As long as sufficient data are available, action recognition coupled with a deep learning network can perform more efficiently than traditional image processing methods.

#### **2. Scope of Action Recognition**

Based on the above understanding, the research results of deep learning-based HAR were primarily interpreted. However, given the challenging nature of HAR, further research is needed to study it from various aspects.

The recognition of an object's posture must precede the action recognition. The pose estimation is usually based on a skeleton model, which consists of joint points and their connections. It is possible to predict specific action by estimating the pose of a person using the joints and skeletal information.

The common network of action recognition may be either a regular convolution neural network (CNN) or a graph CNN. Unlike the recognition of a pose estimation from a fixed time point, it is possible to increase the efficiency of action recognition by adding temporal information along with the spatial information of an object. In some cases, the subject of action recognition is one person, but when multiple people are apparent in the same scene, it is important to process the action recognition of all the people in the scene. Including temporal information of an object's movement is of great help in recognizing specific actions because it can detect movements each minute that cumulatively constitute specific actions. This technique can be used to target action recognition for when multiple people are present in a scene. If the static action recognition is provided with sufficient temporal data, it will be possible to use static action to analyze action captured from videos.

Gestures can convey intentions through various local movements of the arms or fingers in a confined space with a limited range of motions. Therefore, gesture recognition can be used as an important component of action recognition. Thus, this special issue has published research papers focused on gesture recognition.

#### **3. Deep Learning-Based Action Recognition**

Many researchers are interested in and conducting deep learning-based action recognition research. Approximately 25 papers have been submitted to this special issue, and

**Citation:** Lee, H.J. Special Issue on Deep Learning-Based Action Recognition. *Appl. Sci.* **2022**, *12*, 7834. https://doi.org/10.3390/ app12157834

Received: 1 August 2022 Accepted: 3 August 2022 Published: 4 August 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

12 of them were accepted (i.e., 34.2% acceptance rate). This special issue mainly consists of training data, pose estimation of objects, action recognition and gesture recognition. Rey et al. [5] present an approach on how to solve a data shortage problem in deep learning, by extracting synthesized acceleration and gyro norms data from video for human activity recognition scenarios.

There are two papers focused on pose estimation—the first one by S. Kim and H. Lee introduces the Lightweight Stacked Hourglass Network [6], which expands the convolutional receptive field while also reducing the computational load and providing scale invariance. The second paper, authored by J. Wu and H. Lee [7], proposes a Partition Pose Representation, which integrates the instance of a person and their body joints based on joint offset. They also propose a Partitioned Center Pose Network, which can detect people and their body joints simultaneously, then group all body joints.

Four papers deal with action recognition directly by using convolutional networks. The first paper, authored by Dong et al. [8], introduces high-order spatial and temporal features of skeleton data, such as velocity, acceleration, and relative distance, to construct graph convolutional networks. The other three papers adapt the spatio-temporal concept to extract better features. Tasnim et al. [9] suggest a spatio-temporal image formation technique of 3D skeleton joints by capturing spatial information and temporal changes for action discrimination. J. Kim and J. Cho [10] proposes a low-cost embedded model to extract spatial feature maps by applying CNN to the images that develop the video and using the frame change rate of sequential images as temporal information. The low complexity was achieved by transforming the weighted spatial feature maps into spatio-temporal features, and then inputting the spatio-temporal features into multilayer perceptrons. K. Hu et al. [11] propose an improved Long Short-Term Memory (LSTM) network, which is able to extract time information. They enhanced the input differential feature module and spatial memory state differential module to enhance features of actions. A. Stergiou et al. [12] introduce the concept of class regularization, which regularizes feature map activations based on the classes of the examples used. The proposed method essentially amplifies or suppresses activations based on an educated guess of the given class.

There are four papers focused on gesture recognition. Gesture recognition generally consists of a series of continuous actions, so it is necessary to memorize past actions. Four papers independently propose a unique method for gesture recognition. N. Nguyen et al. [13] present a dynamic gesture recognition approach using multi-features extracted from RGB frame and 3D skeleton joint information. N. Do et al. [14] exploit depth and skeletal data for the dynamic hand gesture recognition problem. The paper also explores a multi-level feature LSTM with pyramid and the LSTM block, which deal with the diversity of hand features. Y. Chu et al. [15] present a neural network for sensor-based hand gesture recognition, which is extended from the PairNet. N. Nguyen et al. [16] present another dynamic hand gesture recognition approach with two modules: gesture spotting and gesture classification, which uses bidirectional LSTM and a single LSTM, respectively.

#### **4. Future Action Recognition**

Traditionally, action recognition has been performed directly from videos or images in a single layered manner. The spatio-temporal features are extracted as 2D feature descriptors. Classes of action recognition are rather simple, such as walking, jumping or raising a hand. However, as computing power improves and deep learning techniques are naturally applied to action recognition, many researchers are optimistic about the potential of action recognition. Every day, new data on human actions are being accumulated and learning skills are improving. The need for various applications related to action recognition is also rapidly increasing. Therefore, recognition is attempted by extracting 3D feature values for each intrinsic action. Various modifications of deep networks to reduce the complexity of computations are also being attempted. Ultimately, a deep learning method that can recognize complex actions occurring in the real world is expected to be developed in the future.

**Funding:** This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education (GR2019R1D1A3A03103736) and in part by project for Joint Demand Technology R&D of Regional SMEs funded by Korean Ministry of SMEs and Startups in 2021. (S3035805).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **References**

