*Article* **Low-Cost Embedded System Using Convolutional Neural Networks-Based Spatiotemporal Feature Map for Real-Time Human Action Recognition**

**Jinsoo Kim and Jeongho Cho \***

Department of Electrical Engineering, Soonchunhyang University, Asan 31538, Korea; js.kim@sch.ac.kr **\*** Correspondence: jcho@sch.ac.kr; Tel.: +82-41-530-4960

**Abstract:** The field of research related to video data has difficulty in extracting not only spatial but also temporal features and human action recognition (HAR) is a representative field of research that applies convolutional neural network (CNN) to video data. The performance for action recognition has improved, but owing to the complexity of the model, some still limitations to operation in realtime persist. Therefore, a lightweight CNN-based single-stream HAR model that can operate in realtime is proposed. The proposed model extracts spatial feature maps by applying CNN to the images that develop the video and uses the frame change rate of sequential images as time information. Spatial feature maps are weighted-averaged by frame change, transformed into spatiotemporal features, and input into multilayer perceptrons, which have a relatively lower complexity than other HAR models; thus, our method has high utility in a single embedded system connected to CCTV. The results of evaluating action recognition accuracy and data processing speed through challenging action recognition benchmark UCF-101 showed higher action recognition accuracy than the HAR model using long short-term memory with a small amount of video frames and confirmed the real-time operational possibility through fast data processing speed. In addition, the performance of the proposed weighted mean-based HAR model was verified by testing it in Jetson NANO to confirm the possibility of using it in low-cost GPU-based embedded systems.

**Keywords:** CNN; human action recognition; spatiotemporal feature; embedded system; real-time

#### **1. Introduction**

Human action recognition (HAR) in video is one of the most challenging tasks in the field of computer vision as it requires simultaneous consideration of spatial and temporal representations of motion [1]. Unlike image classification [2] and object detection [3], which utilize spatial expression extracted by convolutional neural network (CNN) based on an image, HAR recognizes action through spatiotemporal feature extracted from time-varying motions, as well as the appearance of the person extracted from the video, which is a series of images [4]. During the early stage, research [5,6] that recognized actions through a two-dimensional CNN, which learns existing spatial expressions, was performed; however, action recognition was difficult as there were limitations in terms of learning the conditions and temporal features, such as the scale and pose of the human appearance, the similarity of movement, and the change of camera's point of view [7]. Therefore, CNN-based models that learn motion representations through spatiotemporal features are being proposed [8–10], because action in video is recognized based on extracted spatiotemporal features by identifying the connectivity of movements that change over time. The models that recognize action by learning time-varying motion patterns based on CNN have achieved significant performance improvements in the field of HAR [11,12]. The latest CNN-based models for HAR learn time-varying motion representations via an extended CNN structure that combines CNNs applied to static images with networks that extract temporal features [13]. Typically, there are 3D CNNs that simultaneously extract

**Citation:** Kim, J.; Cho, J. Low-Cost Embedded System Using Convolutional Neural Networks-Based Spatiotemporal Feature Map for Real-Time Human Action Recognition. *Appl. Sci.* **2021**, *11*, 4940. https://doi.org/ 10.3390/app11114940

Academic Editor: Hyo Jong Lee

Received: 23 April 2021 Accepted: 25 May 2021 Published: 27 May 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

spatiotemporal features, which represent human appearance and motion representations through filters, two-stream CNN, which combines spatial CNN network and temporal CNN network, and convolution recurrent neural network (CRNN), which combines CNN and recurrent neural networks (RNNs) [14].

3D CNN is a structure that applies a convolution layer consisting of 3D kernels to a 3D image stack generated by stacking sequential images over time, simultaneously learning the spatial and temporal features of the input data through a 3D kernel [15]. Therefore, it has the advantage of recognizing action by directly generating hierarchical representations of spatiotemporal information through motion information encoded from the sequential image, and there is a disadvantage [16] that computationally-heavy 3D CNN-based models have high computation and memory cost.

Two-stream CNN recognizes action through a model that fuses spatial CNN network, which extracts appearance information—a spatial feature of motion—from images, and temporal CNN network, which extracts temporal features from motion vectors changing over time, such as optical flow, in parallel structures [17]. Such parallel structures overcome the limitations that conventional CNNs are difficult to learn temporal representations and efficiently fuse spatial features and temporal features extracted from optical flows through late fusion. However, due to the fusion of two CNN networks into a parallel structure, it has a limitation [18] that it is difficult to apply to real-time HAR systems with a long data processing time compared to single-stream CNNs.

CRNNs are single-stream CNN structures and recognize action by identifying context for time-varying motions through RNNs after extracting spatial feature maps from sequential image stacks to CNNs [19]. Long short-term memory (LSTM) is used as a network to analyze time information, and a model that fuses CNN and LSTM is called ConvLSTM. ConvLSTM is an LSTM that learns time-varying motion patterns from images, so it has the advantage of identifying the connectivity of spatial features that change over time and of extracting temporal features well. However, ConvLSTM has the disadvantage that the more images that are extracted from the video, the more the complexity of the model increases.

The CNN-based HAR models that were mentioned can be used in various fields that require actual HAR through the action recognition performance that improved compared with existing techniques, but owing to the complexity of the model and long data processing time, there are still limitations to be operated in real-time in low-cost embedded systems [20]. Recently, HAR has been applied to integrated monitoring systems, which detect emergency situations with CCTVs. Most of integrated monitoring systems receive multiple CCTV images through network-based digital video communication and extract monitoring information [21] as shown in Figure 1. The video data transmission process is used to recognize behavior in a server PC equipped with a GPU that has strong computational efficiency owing to the complexity of the HAR model. Therefore, HAR models have a limitation in recognizing behavior in real-time, being installed in a single embedded system directly connected to the CCTV.

**Figure 1.** An example of a wireless communication network for recognizing behavior through CCTV using a complex HAR model.

In particular, it is essential for the system to recognize human action quickly in urgent situations related to the safety, such as violence and theft on the streets, and in emergencies, such as detecting urgent situations or abnormal behaviors of the elderly and single-person households [22]. When the action is recognized through the HAR model in the embedded system where the camera is installed, time and cost in the process of transmitting highcapacity video data can be reduced [23].

Therefore, in this paper, we propose a weighted mean-based single-stream CNN model that recognizes action faster than conventional models. As the proposed method has a simpler structure than the existing CNN-based models with high complexity, recognizing behavior in a single embedded system connected to CCTV without transmitting image data through network communication is possible. The main idea is to build a lightweight CNN-based HAR model that can be applied to low-cost embedded systems with low-end GPUs to apply HAR to CCTV-based surveillance fields and to recognize instantaneous motions in emergency situations at high speed. The proposed system extracts spatial and temporal features by weight and averaged the change rate of frames according to time on a spatial feature map extracted by CNN. The weighted mean is calculated sequentially by the change rate of each frame extracted according to any time interval from the video and the change rate of frame at the corresponding time is weighted in the process of averaging feature maps, so the spatial and temporal features are created in the contexts of the integrated time-varying motion representation. Spatial feature maps at a specific time where the amount of motion change is high are weighted more than the point of time when the amount of change is low and the context of the motion representation is created; spatial representations are extracted efficiently at the point of time affecting the recognition performance of systems. The weighted mean also recognizes action by processing data at a rapid rate through a lighter structure than conventional CNN-based HAR models. 3D CNNs extract temporal features using spatiotemporal filters, two-stream CNNs utilize optical flows, and ConvLSTMs utilizes LSTMs to extract temporal features. Such models take a long time in the calculation process with an extended CNN structure combining networks that extracts temporal features in the existing two-dimensional CNN structure. However, the proposed model recognizes action by inputting a one-dimensional spatiotemporal feature vector into FC layers, where the vector is generated by spatial feature maps and frame change rate using weighted mean. Therefore, video data are processed at high speed while maintaining the existing two-dimensional CNN structure.

#### **2. Related Works**

In ImageNet Challenge 2012 [24], a deep learning-based CNN with superior performance than the algorithm that was to be used in the existing computer vision has been proposed, and in recent studies, the deep learning-based CNN shows high level of usability for HARs. HAR models applied by CNN automatically extract and learn motion features from video and utilize the motion features extracted from different modalities of data to enhance the performance for action recognition according to modality of data perspective; action recognition models are largely divided into depth-, skeleton-, and vision-based HAR [25].

#### *2.1. Depth-Based Human Action Recognition*

Depth- and skeleton-based HARs [26,27] recognize actions using change in motion representation of the depth map acquired through the depth sensor. The depth map comprising RGB-D video has the spatiotemporal structure. Changes in the depth information over time are extracted to the spatiotemporal features of motion [28]. In addition, depth maps clearly separate people from backgrounds to represent appearance information, so they can be used for meaningful feature extraction for action recognition.

Zhanga et al. [29] propose orientation histogram features of 3D normal vectors to extend the features of histogram oriented gradient (HOG) extracted from the depth map to a spatiotemporal depth structure and to represent appearance information of a threedimensional depth structure of the spatiotemporal depth. Authors in [30] construct supernormal feature vectors based on depth map sequences to represent motion representations for action recognition. Feature vectors are generated by applying spatial average pooling and temporal maximum pooling to time-varying depth maps, and the evaluation results on various benchmark datasets show robustness about scale change.

The HAR models using depth information have shown high action recognition performance, but have limitations, applying only to a limited range and specific environment. Generally used depth sensors include stereo cameras using triangulation techniques, time of flight (TOF) cameras, and structured-light cameras. Depth sensors using stereo cameras are inexpensive, but the process of calculating depth information is complicated, making it difficult to acquire accurate depth information. Depth sensors based on TOF cameras and structured-light cameras have limitations of difficulty to be applied outdoors, which are heavily influenced by light. In addition, the depth sensors represent the information included in the measurable distance as data, which is difficult to apply to outdoor surveillance fields. Some sensors, such as light detection and ranging, are robust to lighting and measure a relatively wide range of depth but are expensive and not suitable for use in video-based surveillance systems.

#### *2.2. Skeleton-Based Human Action Recognition*

Skeleton-based HAR [31,32] recognizes action through joint points extracted via CNNbased pose estimation algorithms from depth maps or RGB images. The location of a person's joint points represented by the time axis extracted from the video is used as the feature vector, which is connected according to the body structure of the person, and adjacent points have an important correlation with each other.

Warcho et al. [33] proposed a CNN-based HAR model that automatically learns the spatial and temporal features of data based on joint points. To reduce the redundancy of data and preserve spatiotemporal features, key frames are extracted using interframe difference methods, and joint points are generated through the open-pose [34] and then CNN is applied. Recently, more research related to skeleton-based HAR has been proposed than depth-based HAR, but since the process of extracting joint points with pose estimation algorithms is preceded, there is a disadvantage that the accuracy of joint points varies depending on sensor performance. The whole recognition performance of systems can be degraded if joint points containing noise are obtained by sensor performance, or if they are affected by external environmental factors such as lighting and occlusion [35].

#### *2.3. Vision-Based Human Action Recognition*

Before a deep learning-based CNN was applied to HAR, the conventional method recognized actions based on hand-crafted features for some human-performed actions in a simple background. Hand-crafted features include spatiotemporal interest points (STIPs) [36], 3-dimensional HOG [37], 3-Dimensional Scale Invariant Feature Transform (3D-SIFT) [38], which use various feature encoding schemes such as histograms or pyramids. In [39], the geometric properties of space-time volume according to human movement are extracted with action sketches, which stacked body outlines on the time axis according to direction, speed, and shape. These low-level features were entered into machine learningbased classification algorithms, such as SVM, decision tree, and K-nearest neighbor, and used in HARs.

However, the deep learning-based CNN is proposed and research on various CNNbased methods is being conducted to build HAR models; typically, there are methods such as 3D CNN [40], two-stream CNN [41], and ConvLSTM [42]. Recently, a study of building HAR models by fusing the above methods and a method of entering depth maps and joint points into an algorithm of vision-based HAR models have also been proposed. Karpathy et al. [43] propose a spatiotemporal LSTM (Spatial LSTM) model for 3-dimensional HARs that extend the RNN into the spatiotemporal domain to analyze the hidden features of motion representations. In addition, in [44], a study was conducted to classify an action in

spatial and temporal encoding information of depth map sequences by applying 3D CNNs to encode motion patterns with spatiotemporal features in the depth map sequences. The advantages and disadvantages of these three HAR modalities are summarized in Table 1.


**Table 1.** Comparison of advantages and disadvantages of data modality for behavior recognition.

#### **3. Proposed Methodology**

*3.1. System Overview*

The proposed embedded system-based HAR model uses a spatial feature map extracted using CNN through the weighted mean and the temporal feature extracted via frame change rate, and then, generates spatiotemporal features. The generated spatiotemporal features are entered into a multilayer perceptron (MLP), which is lighter than the existing networks that were used in the HAR models, outputting action classes and schematizing the block diagram of the entire system, as shown in Figure 2. Image sequences, which make up a video, entered into a model, are converted into N frame stacks (*FSN*) with random intervals and extracted via CNN into feature map (*FM<sup>N</sup> <sup>K</sup>* ) with spatial features (*K* means the length of a feature map that is flattened in one dimension). Here, CNN is used as a feature extractor to extract a spatial feature map, and the fully connected layer connected to the end of the CNN is removed. *FS<sup>N</sup>* extracts frame difference (*FDN*) of each frame constituting *FS<sup>N</sup>* as a temporal feature, since the sequential frames within a video contain time information with sequentially stacked data. Finally, to fuse spatial and temporal features, *FM<sup>N</sup> <sup>K</sup>* is weighted and averaged to *FD<sup>N</sup>* according to the interval of *FS<sup>N</sup>* converted from video to generate feature vectors (*FVK*) with spatiotemporal features and input them into MLP. The MLP, which is independent from CNN, outputs predefined action classes and schematized the process of recognizing action by receiving video data, which is shown in Figure 3.

**Figure 2.** Block diagram of the proposed weighted mean-based human action recognition (HAR) model.

**Figure 3.** Action recognition process of the proposed system.

#### *3.2. Extract Spatial Features*

Videos are sequential data, in which frames representing spatial representations are listed according to time information, and the sequence of frames is the time information required to recognize the context of the action contained in the video. The video from the UCF-101 dataset processes frames at 25 fps. When all frame sequences per unit time are entered into the model, the processing time according to the computational cost increases and has a lot of redundant data. Considering processing complexity according to the data, the proposed system extracts N frames at random intervals and converts the image sequence of input video into *FSN*.

*FS<sup>N</sup>* is inputted to the CNN structure *VGGNet* [45], which is pretrained with ImageNet dataset and is extracted to feature maps. Moreover, the resolution of the images in *FS<sup>N</sup>* is converted to (224, 224, 3) according to the kernel size of *VGGNet*. *VGGNet* is a CNN structure that has deeply stacked convolutional layers through (3, 3) filters, with fewer parameters than convolutional layers, which were shallowly stacked through a larger sized filter. Therefore, although *VGGNet* is a deep layer structure, it is a CNN structure that has been proven to extract feature maps at high speed from input data and proven to extract significant features, which reduce classification errors as the layer deepens. In addition, *VGGNet* has the size (7, 7, 512) of a feature map generated from the convolutional layer prior to the FC layer, which is smaller than other CNN structures, such as *ResNet* and *Inception*. The proposed system is light weighted using *VGGNet*, which extracts relatively small size feature maps to build HAR models that process data at high speed. *MobileNet*, a CNN structure that could be used in embedded systems with relatively low performance

GPUs and memory, was also proposed. However, the size of the final feature map was (7, 7, 1024), which was not suitable for the purpose of the system to be built in this paper, which requires faster data processing by outputting a larger feature map than the *VGGNet* used in this paper.

#### *3.3. Integrated Spatial and Temporal Features Based on Weighted Mean*

*FS<sup>N</sup>* is a collection of image data including time information sequentially listed according to any interval, and the model extracts *FDN*, which is the frame change rate of each image included in *FSN*, as a temporal feature. *N* means the number of frames that constitute *FSN*, and since they are extracted according to the same time interval, the frame change rate, *FDN*, that changes according to the time interval can be used as a temporal feature of motion representation. *FD<sup>N</sup>* is calculated as the average of the frame change sequentially calculated according to the image sequence of *FS<sup>N</sup>* and it is transformed to one-dimensional temporal feature vectors. As compared to optical flow applied to two-stream CNNs, LSTM applied to single-stream CNNs is simpler to operate and it is extracted at high speed. The frame change rate calculated by the frame change of the images constituting *FS<sup>N</sup>* additionally calculates the frame change for the whole frame of the video and the first frame of *FS<sup>N</sup>* because it has the length of *<sup>N</sup>* − 1. Therefore, *FD<sup>N</sup>* has the same length as the data length *N*, indicating the time information of *FS<sup>N</sup>* extracted according to any interval in the video, and *FD<sup>N</sup>* in *<sup>N</sup>* means the frame change rate of *FMN*−<sup>1</sup> *<sup>K</sup>* and *FM<sup>N</sup> <sup>K</sup>* . *FM<sup>N</sup> <sup>K</sup>* has a size of (*N*, *K*) over the entire time because *K* one-dimensional vectors are extracted at each time. The data in each row (*N*) of *FM<sup>N</sup> <sup>K</sup>* mean spatial features extracted from images at a specific time and are sorted in a column direction over time. In the proposed system, *FVK* with spatiotemporal features is generated by weighted mean of *FDN*, which represents the degree of change of each frame over time in the spatial feature *FM<sup>N</sup> <sup>K</sup>* sorted according to time information, as shown in Equation (1).

$$FV\_K = \left[ \frac{\sum\_{i=1}^{N} FM\_1^i \times FD^i}{\sum\_{i=1}^{N} FD^i} \frac{\sum\_{i=1}^{N} FM\_2^i \times FD^i}{\sum\_{i=1}^{N} FD^i} \cdot \dots \frac{\sum\_{i=1}^{N} FM\_{K-1}^i \times FD^i}{\sum\_{i=1}^{N} FD^i} \frac{\sum\_{i=1}^{N} FM\_K^i \times FD^i}{\sum\_{i=1}^{N} FD^i} \right] \tag{1}$$

The weighted mean is used to convert the spatial feature map, *FM<sup>N</sup> <sup>K</sup>* , listed over time into a spatiotemporal feature during the entire time, and the elements of each feature map are averaged by weighting *FD<sup>N</sup>* at the corresponding time according to size *K* of spatial feature maps. *FD<sup>N</sup>* is a temporal feature extracted from spatial information represented by the motion information of an object and background information, which changes according to the camera's point of view, so it can be matched with spatial features simultaneously when *FM<sup>N</sup> <sup>K</sup>* is extracted. According to Equation (1), the system divides the sum of the result multiplying the *K*th element of each feature map by *FD<sup>N</sup>* at the corresponding time and the sum of *FD<sup>N</sup>* to generate spatiotemporal features *FV*. The elements of feature maps are generated as spatiotemporal features for action recognition through weighted mean calculations considering frame change rate during the entire time. Figure 4 is a result of schematizing the operation process when *K* = 1 and generates *FVK* by repeating the operation *K* times according to the size of the spatial feature map. The weighted mean is weighted to the feature point of video when the motion or background information of an object changes greatly according to the time information of *FDN*, thereby enabling the context of the spatial feature that changes throughout the whole time. Finally, *FVK* generated through the weighted mean is inputted to the FC layer of the MLP structure, and the MLP outputs the class of the predefined videos.

**Figure 4.** Process for generating weighted mean-based spatiotemporal features.

#### *3.4. Action Recognition Using Multilayer Perceptron*

Finally, the proposed HAR model recognizes an action by inputting the spatiotemporal feature vector, *FVK*, generated by the weighted mean into the MLP. The MLP receiving the *FVK* has three hidden layers, the input layer consists of 25,088 nodes, and the output layer consists of 101 nodes, which is the number of predefined classes. Each hidden layer consists of 2048, 1024, and 512 nodes. The batch size set in the input layer of the model is 64, and to prevent overfitting during the learning process, dropout is applied at a ratio of 0.5 in the first hidden layer. The categorical cross-entropy, which is a loss function to classify multiple classes according to the purpose of the system, was selected and adaptive moment estimation (ADAM) was applied to the model.

#### **4. Experimental Results**

#### *4.1. Experimental Setup*

The performance evaluation results of the proposed weighted mean-based action recognition model are described. The learning and testing of the model proceed with HAR benchmark dataset UCF-101.

UCF-101 consists of 13,220 video clips and 101 actionable classes taken on YouTube. The actionable categories are divided into (1) human–object interaction, (2) body-motion only, (3) human–human interaction, (4) playing musical instruments, and (5) sports. Video is a relatively challenging dataset, filmed at various illuminations, poses of people, and viewpoints of cameras. According to the train–test list of provided datasets, 9.5K datasets are divided into learning and 3.8K datasets are divided into tests, and 20% of the learning datasets are used as validation datasets. In addition, the train–test list consists of three scenarios with random order of video data, so the average of three performance evaluations was selected as the final result.

The proposed system was trained at the workstation, and performance evaluations were conducted at the workstation and NVIDIA Jetson NANO. At the workstation, tests were conducted to compare and evaluate the accuracy of the proposed method with the existing HAR models, and at Jetson NANO, the tests were performed to evaluate usability based on throughput of low-cost embedded systems. The model of the proposed system is

built in the Python environment, Keras, and the system environment of the workstation consists of Intel i9-10900X CPU, NVIDIA Titan RTX (24 GB) GPU, and 128 GB of main memory. The system environment of Jetson NANO consists of Quad-core ARM A57 CPU, 128-core Maxwell GPU, and 4 GB of main memory, as shown in Figure 5.

**Figure 5.** The proposed Jetson NANO-based HAR system appearance.

#### *4.2. Performance Evaluation*

The main idea proposed in this paper is to recognize action at a rapid rate through the spatiotemporal feature vector (*FVK*) generated by weighted mean of the spatial feature map (*FM<sup>N</sup> <sup>K</sup>* ) extracted from the frame stack (*FSN*) consisting of sequential images with the change rate (*FDN*) of each frame representing time information. The performance evaluation is conducted by comparing the action recognition accuracy and complexity of the HAR model using the proposed weighted mean and the existing HAR models. The comparison results of performance evaluation are shown in Table 2, and results are obtained using workstation. Models used for comparison of performance evaluation received RGB stream exactly as the system was built in this paper, and complexity of model and the average accuracy were compared according to the three random scenarios of the train–test split list provided by UCF-101. Equation (2) represents the complexity of each model, and the definition of each factor is defined in Table 3. If the network of the model is 3D CNN, time-axis operation is added to the convolution operation. The recognition accuracy of 3D CNN and single-stream CNN was 84.8%, 85.2%, and 88.1%, respectively, and the recognition accuracy of the model using the LSTM-based method was 90.8% and 91.21%. The LSTM-based action recognition model shows higher accuracy than 3D CNN and single-stream CNN because sequential data are received and weighted to identify connectivity. However, the proposed model based on single-stream CNN recognized action with 2.48% higher accuracy than deep LSTM. This study proved that the proposed method using the change rate of sequential frames as time information can effectively identify the connectivity of motion for action recognition.

$$\mathrm{N\_{CNN}} = \sum\_{conv} 2\mathrm{C\_{k-1}C\_{k}N\_{k}^{(kernel)}}N\_{k} + \sum\_{pool}\mathrm{C\_{k}N\_{k}}\left(N\_{k}^{(pool)} - 1\right) + 2(\mathrm{C\_{k}N\_{k}N\_{d}} + N\_{d}N\_{out})$$

$$\begin{aligned} \mathrm{N\_{LSTM}} &= \mathrm{L} \cdot \sum\_{k=1}^{K} (8(\mathrm{N\_{k-1}} + \mathrm{N\_{k}})N\_{k} + 4N\_{k}) + 2N\_{K}N\_{out} \\ \mathrm{N\_{BILSTM}} &= \mathrm{L} \cdot \sum\_{k=1}^{K} (8(\mathrm{N\_{k-1}} + \mathrm{N\_{k}})N\_{k} + 4N\_{k}) + 2N\_{K}N\_{out} \end{aligned} \tag{2}$$


**Table 2.** Comparison of average accuracy of the proposed method for UCF-101 with other methods.

**Table 3.** Notation of equations indicating model complexity.


In addition, additional experiments for performance evaluation were conducted to verify the applicability of the action recognition model on embedded systems equipped with low-cost GPUs. Experiments were conducted by means of "Element-wise Mean," "Weighted Mean," "ConvLSTM," and "Bi-ConvLSTM." The "Element-wise Mean" means an operation that averages *FM<sup>N</sup> <sup>K</sup>* outputted from CNN according to the number of *N* without weighting time information.

The frame number (*N*) of *FS<sup>N</sup>* extracted according to a random interval is increased in five intervals from 10 to 40, the action recognition accuracy was measured, and testing time times were compared according to the frame number (*N*) of *FSN*, and then testing time was measured by the average time it takes to output the action class from each video of the test data set. The action recognition accuracy and testing time of three methods according to *N* number of *FS<sup>N</sup>* extracted from the video data according to any interval are shown as a graph in Figure 6 and results are obtained using Jetson NANO.

**Figure 6.** Accuracy and testing time according to frame number of *FS<sup>N</sup>* extracted at a random interval: (**a**) accuracy and (**b**) testing time.

First, the comparative evaluation was conducted with the case of applying "Elementwise Mean" to determine whether the weighted mean effectively extracts meaningful spatiotemporal features to recognize action by identifying connectivity of time-varying spatial features. Performance evaluation results show that the proposed method for weighted *FDN*, in all *N* measures higher accuracy than "Element-wise Mean" and the weighted mean generates spatiotemporal features to recognize action efficiently. When the spatial and temporal features are fused through unweighted "Element-wise Mean" of *FDN*, the spatiotemporal features are generated by relying solely on the number of *N*, which is the interval of *FSN*. Since videos are sequential data with time-varying frames, even in the case of considering only the interval of *FSN*, the connectivity of the timevarying spatial features can be identified. The motion and background information of an object have nonlinear features rather than constant changes such as the interval of *FSN*, so if frames depend only on extracted intervals, changes in various environmental conditions cannot be considered. Various environmental conditions include changes in camera's viewpoints and influences of external environmental factors such as lighting and obstacles, according to data acquisition environment, and accordingly, the scale and pose of people's appearances change and motion representation changes over time. Therefore, if spatiotemporal features are generated through "Element-wise Mean," the spatiotemporal features cannot be efficiently generated because spatial representation changes according to the above conditions cannot be identified at a specific time. However, considering the change rate of frames according to the interval extracted through the weighted mean, since big changes in spatial representation are weighed with time information at a specific time and spatiotemporal features are generated, the action recognition performance was improved by considering the motion and background information of objects changing over time effectively.

In addition, as a result of the performance comparison with weighted mean and LSTM-based models, "ConvLSTM" and "Bi-ConvLSTM", when extracting more than 30 frames, LSTM-based models predict action with high recognition accuracy of 2–3% or more. However, when 20–30 frames are extracted, we can see that the proposed weighted mean-based model predicts action with high recognition accuracy of 20% or more. In particular, when 20 frames are extracted, the proposed weighted mean-based model and LSTM-based model showed the greatest performance difference, and the action prediction accuracy and testing time for videos in each method when N = 20 are shown in Table 4 where the tests were conducted in Jetson NANO.

An LSTM prediction network has been proposed to solve the problem of long input data length and disappearing influence on initial inputs in RNNs, which sequentially receive and process time series data. It controls the amount of information transmitted from hidden layers to input gates and oblivion gates. When extracting a large number of frames from video, the amount of data increases and the extracted frame interval shortens, so it is advantageous to identify time-varying spatial features. Therefore, when extracting more than 30 frames, LSTM-based models could recognize action with higher accuracy than weighted mean-based models. In addition, "Bi-ConvLSTM" utilizes information when sequentially listed *FM<sup>N</sup> <sup>K</sup>* is inputted in the reverse direction; "Bi-ConvLSTM" has higher accuracy than "ConvLSTM" when extracting frames number (*N*) increases. This shows that the LSTM-based model shows higher performance as the large number of frames is extracted from the videos.

**Table 4.** Comparison of action prediction accuracy using four methods, when *N* = 20 .



However, as the results of comparing data testing time increase, the proposed weighted mean-based HAR model showed faster data processing speed than LSTM-based models when extracting more than 30 frames. This is because the proposed model recognizes action using MLP, which is lighter than LSTM as a prediction network. LSTM receives time series data sequentially listed over time and controls the amount of information stored while repeating the operation of hidden layers by the length of data. Therefore, when extracting many frames from videos, LSTM has a structure that receives a relatively long time of data input, and the data processing speed is slowed. The data processing speeds of "ConvLSTM" and "Bi-ConvLSTM" in Figure 5 show that the data processing speed is slower as *N* increases. However, MLP predicts action at a faster rate than LSTM by conducting forward propagation operations regardless of data length. LSTM receives data sequentially to identify the temporal features of spatial feature maps, which are inputted, and repeats the computation of hidden layers as the frame number (*N*). However, the proposed model receives the spatiotemporal features generated by the weighted mean of

*FM<sup>N</sup> <sup>K</sup>* to *FD<sup>N</sup>* representing time information at a time as data of the entire time. Jetson NANO, with low-cost GPUs, can recognize action at a faster speed than LSTM. In addition, the weighted mean-based model recognizes action with a high accuracy of 30% or more even when extracting ~10 frames less than LSTM-based models. This shows that the proposed method, which generates spatiotemporal features through weighted mean efficiently, extracts motion representations for action recognition. Therefore, the HAR model can be constructed by extracting relatively few frames and the instantaneous action can be recognized quickly.

#### **5. Conclusions**

In this paper, we proposed a weighted mean-based spatiotemporal feature extraction technique to build a CNN-based HAR model that recognizes action by processing video data at high speed. Previously, an action was recognized by analyzing motion patterns based on time information through models with complex structures such as 3D CNN, two-stream CNNs using optical flows, and ConvLSTMs. However, the proposed method recognizes an action by extracting frame changes, which are calculated at high speed with temporal features. The temporal feature is used to generate the spatiotemporal features during the entire time by weighing the spatial features extracted from CNNs and through the weighted mean spatial information at the point where motion changes significantly. The generated spatiotemporal features are used to recognize action by entering into MLPs with lower complexity than the prediction model used in the existing HAR, and the proposed model is verified via experiments to recognize action by processing data at a faster speed than the existing CNN-based HAR models. In addition, the efficiency of extracting spatiotemporal features was verified using the frame change rate as time information through higher action recognition performance compared to general average technique. Performance evaluation results according to the number of frames extracted from videos showed that action was recognized with high accuracy even while extracting fewer frames than the HAR model using LSTM. Finally, to assess real-time possibility in embedded systems of low-cost GPUs, the results of performance evaluation in Jetson NANO also show that data are processed at high speed. It was possible to verify the system's utilization value to recognize instantaneous action in an emergency situation.

Nevertheless, the proposed model still has a weakness in that it is vulnerable to rapid changes of background or obstacles because it recognizes the action by using the changes in image frames over time. Changes in frames caused by background or obstacles, not actions of a person to be recognized, can act as noise and deteriorate action recognition performance. In particular, it is very difficult to recognize an action in a camera that is constantly moving, not a CCTV, which is a fixed type of camera we have adopted, because the frame is continuously changing. Addressing this issue is the direction of research we are pursuing in the future.

**Author Contributions:** All authors took part in the discussion of the work described in this paper. Conceptualization, J.K.; Data curation, J.K.; Investigation, J.K.; Project administration, J.C.; Writing original draft, J.C.; Writing—review & editing, J.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by a National Research Foundation of Korea (NRF) grant funded by the Korean government (MOE) (No.2018R1D1A3B07041729) and the Soonchunhyang University Research Fund.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author.

**Acknowledgments:** The authors thank the editor and anonymous reviewers for their helpful comments and valuable suggestions.

**Conflicts of Interest:** The authors declare that they have no competing interests.

#### **References**

