*1.2. Related Works*

Assigning a single action label to a multi-person video clip dilutes the specificity of information and makes it less meaningful. For many real-world applications such as video-based assessment of human behavior, there is a need for person-centric action recognition, which assigns an action label to each person in a multi-person video clip. One of the challenges in person-centric action recognition is robust tracking of the target in long-term videos. Tracking is challenging because there are many sources of uncertainty, such as clutter, occlusions, target interactions, and camera motion. However, most of the research studies on human activity classification have typically dealt with videos with a single human actor or video clips with ground-truth tracking provided [18], with the exception of few that performed human-centeric action recognition [8,19]. Girdhar, et al. [19] re-purposed an action transformer network to exclude non-target human actors in the scene and aggregated spatio–temporal features around the target human actor. Chen, et al. [8] presented human activity classification using skeleton motions in videos with interference from non-target objects aimed at supporting applications

in monitoring frail and elderly individuals. However, neither work provided details on how they addressed non-target filtering in their human action classification pipelines.

Beside the importance of dealing with non-target objects in providing a well-performing real-world human action recognition system, creating robust and discriminating feature representations for each video action clip plays an important role in detecting different human activities [20]. Most of the state-of-the-art action recognition architectures process appearance and motion cues in two independent streams of information, which are fused right before the classification phase or a few stages before the classification stage in a merge and divide scheme [21,22]. Others have used 3D spatio-temporal convolutions to directly extract relevant spatial and temporal features [23–25]. However, human pose cues, which can provide low-dimensional interpretations for different activities, have been overlooked in these studies. Most recently, Choutas, et al. [26] and Mengyuan, et al. [27] used temporal changes of pose information with two different representations for boosting action recognition performance. In [27], authors claim that if there are multiple people in the scene, pose motion representation does not need the time associations of the joints to work but they did not address how their proposed method can handle multiple human actors in a video.

In general, convolutional neural network (CNN) based action recognition approaches can be divided into three different categories based on their underlying architecture: (1) spatio-temporal convolutions (3-dimensional convolutions), (2) recurrent neural networks, and (3) two stream convolutional networks. The benefit of multi-stream networks is that different modalities can be aggregated in the network to improve performance of the final action classification task. In this paper, we addressed the problem of person-centric action recognition by long-term tracking of the target human actor. In addition, our method provides a novel pose evolution representation of the target human actor rather than the common spatio-temporal features extracted from raw video frames to the classification network. It is worth mentioning that our pose-based action recognition stream can be used to augment the current multi-stream action classification networks.

The rest of the paper is organized as follows. In Section 2, we describe the proposed method for tracking target human actor in untrimmed videos in order to extract appropriate pose evolution features from actions performed in a video. In Section 3, we describe the subsequent stages for action classification (illustrated in Figure 2), which include pose evolution feature representation and classification network. We present our experimental setup and performance evaluation results of the proposed method in Section 4. Finally, we discuss the results in Section 5 and conclude our paper in Section 6.
