**1. Introduction**

Clinical assessment of human motor behavior plays an important role in the diagnosis and management of medical conditions like Parkinson's Disease (PD) [1]. However, such assessments can only be performed intermittently by trained clinical examiners, which limits the quantity and quality of information that can be collected to understand the impact of disease in the real-world setting. To address these limitations, significant efforts have been made to develop wearable sensing technologies that can be used for continuously monitoring various types of motor symptoms and behaviors [2–4]. While data collected using wearable sensors are well suited for detecting and measuring basic movements (e.g., arm or leg movements, tremor) and actions (e.g., sitting, standing, walking), they are ill-suited when it comes to complex activities (e.g., cooking, grooming) and behaviors (e.g., personal habits, routines)—particularly if they involve the interpretation of environmental interactions (e.g., with other humans, animals, or objects). Understanding the various factors that influence physical behavior can help clinicians better understand the impact of motor and non-motor symptoms on the daily life of patients with PD [5].

Recently, artificial intelligence (AI) assisted classification of human behavior using computer vision has received newfound attention among researchers in machine learning and pattern recognition communities for applications spanning from automatic recognition of daily life activities in smart homes to monitoring the health and safety of elderly and patients with mobility disorders in their homes/hospitals [6–12]. However, in contrast to wearable devices, vision-based approaches pose a greater risk to privacy and security of an individual [13]. Vision-based assessment of human behavior enables us to automate the detection and measurement of the full range of human behaviors. As illustrated in Figure 1, the taxonomy of human behaviors can be viewed as a four-level hierarchical framework with basic movements at the bottom (e.g., movement of body segments) and complex behaviours (e.g., personal habits and routines) at the top. Automatic recognition at any level requires that actions and/or behaviors at the level below it are also recognized. For example, in order to recognize walking, we first need to assess if the pose is upright, the arms are swinging and legs are moving. At the first level (motion), recognition deals with tasks such as movement detection or background extraction/segmentation in video recordings of the target [14–16]. These techniques try to locate the moving objects in a scene by extracting a silhouette of the object in a single frame or over a few consecutive frames. However, segmentation algorithms without any further processing provide only very basic pose estimation of the object with little to no temporal information. At the second level (action), human movements along with environmental interactions are classified in order to recognize what the target is doing over a period of seconds or minutes [17]. At the third level (activity), the recognition task is focused on identifying activities as a combination of sequence of actions and environmental interactions over a period of minutes to hours. Finally, at the fourth level (behavior), sequence of activities and environmental interactions along with information about their temporal dependencies are used to recognize complex human behaviors.

**Figure 1.** Taxonomy of human behaviors with different levels of semantics and complexity. Recognition of each level requires most of the underlying tasks to be recognized [6].

#### *1.1. Our Contributions*

Automated assessment of human behavior in multi-person video (i.e., when several people are present in the video) requires the tracking and classification of a sequence of actions performed by a target (e.g., patient). Therefore, accurate temporal tracking of the target is an essential requirement for this application, along with robust feature extraction that can be used for classifying human behaviors at different levels of complexity. In this paper, we present a hierarchical target-specific action classification method, which is illustrated as a block diagram in Figure 2. Detection of different actions performed by the target is done using pose evolution feature representation. We define pose evolution as a low-dimensional embedding of a sequence of posture movements that are required to perform an action (e.g., walking). In order to find the pose evolution feature representation corresponding to the target, we present a cascaded target pose tracking algorithm that receives multi-person pose estimation results from an earlier stage and tracks the target pose throughout the video. Our main contributions in this paper are: (1) development of a robust hierarchical multiple-target pose tracking method to facilitate action recognition in videos recorded in uncontrolled environments in the presence of multiple human actors; (2) introducing pose evolution, an explicit body movement representation, as complementary information to the appearance and motion cues for robust action recognition; and (3) a novel target-specific action classification architecture applied to untrimmed video recordings of patients with PD.

**Figure 2.** Overview of the proposed multi-stage method for human behavior phenotyping in untrimmed videos. At the first stage, human detection and pose estimation are applied to the recorded video. At the second stage, the regressed bounding boxes for each detected person and corresponding keypoints are used for tracking the identities in the video. Tracking is done in an incremental process incorporating both appearance and time information. Outputs of tracking the target identity along with ground-truth time segmentation are used for generating a compact representation of the target actor pose evolution in time for each action clip. Finally, the augmented pose evolution representation is fed to a convolutional neural network (CNN)-based action classification network to recognize actions of interest.
