3.3.2. Object Trajectory

Once having been detected in 2D, the object is tracked in 3D using the ICP algorithm. The resulting successive transformations are modeled as a trajectory in *SE*(3). This trajectory is then mapped to the Lie algebra *se*(3) and is fused with the trajectory in *se*(3) × ... × *se*(3) (*n* − 1 times) generated by the body parts. The trajectories modeling the activity lie in *se*(3) × ... × *se*(3) (*n* times) and have to be compared independently of the execution rate. Therefore, they are considered as time parameterized curves, and an elastic metric will be used to provide a time re-parameterization-invariant metric. The additional trajectory (generated by the object) is used only when comparing the proposed approach to RGB-D-based approaches. In this case, the output of the last layer of the object detection neural network applied on the first frame is also used (when the color channel is considered) to build the final feature vector.

### **4. Rate Invariance Modeling and Classification**

We start by outlining a mathematical framework for helping in analyzing the temporal evolution of human activity when viewed as trajectories on the shape space of parametrized curves. This framework respects the underlying geometry of the shape of the trajectories, seen as curves, and helps maintain the desired invariance, especially re-parameterization of the trajectory curve that represents the execution rate. The next step is to calculate the distance between a given trajectory (to classify) to all training ones; let k trajectories be in the training set, resulting in a k-dimensional feature vector.

### *4.1. Elastic Metric for Trajectories*

This representation has been used previously in biometric and soft-biometric application [38–43]. In our case, we will analyze the shape of the trajectories by the square root velocity function (SRVF) *q* : *I* → R*n* defined as:

$$q(t) = \frac{\sigma(t)}{\sqrt{||\sigma(t)||}}\tag{6}$$

*q*(*t*) is a special function introduced in [35] that captures the form of *σ*(*t*) while offering easy calculations, and the *L*<sup>2</sup> norm represents the metric that allows us to compare the shape of two trajectories. The set of all trajectories, denoted as *C*, is thus defined as follows:

$$\mathcal{C} = \{q: I \to \mathbb{R}^n || |q| | = 1\} \subset \mathbb{L}^2(I, \mathbb{R}^n) \tag{7}$$


$$\psi(\tau) = \frac{1}{\sin(\theta)} \times \left( \sin((1-\tau)\theta) \times q\_1 + \sin(\theta \tau) \times q\_2 \right) \tag{8}$$

The geodesic length is *θ* = *d*C (*q*1, *q*2) = *cos*<sup>−</sup><sup>1</sup>(<sup>&</sup>lt; *q*1, *q*2 <sup>&</sup>gt;). Let us define the equivalent class of *q* as: [*q*] = {*γ*˙(*t*) × *q*(*γ*(*t*)), *γ* ∈ <sup>Γ</sup>}. The set of such equivalence classes, denoted by S .= {[*q*]|*q* ∈ C}, is called the shape space of open curves in R*<sup>n</sup>*. As shown in [35], S inherits a Riemannian metric from the larger space C due to the quotient structure. To obtain geodesics and geodesic distances between elements of S, one needs to solve the optimization problem:

$$
\gamma^\* = \arg\min\_{\gamma \in \Gamma} d\_\varepsilon(q\_1, \sqrt{\dot{\gamma}} \times (q\_2 \circ \gamma)).\tag{9}
$$

The optimization over Γ is done using the dynamic programming algorithm. Let *q*∗2 (*t*) = ˙ *γ*<sup>∗</sup>(*t*) × *q*2(*γ*<sup>∗</sup>(*t*))) be the optimal element of [*q*2], associated with the optimal re-parameterization *γ*∗ of the second curve, then the geodesic distance between [*q*1] and [*q*2] in S is *ds*([*q*1], [*q*2]) .= *dc*(*q*1, *q*∗2 ), and the geodesic is given by Equation (8), with *q*2 replaced by *q*<sup>∗</sup>2.

### *4.2. Feature Vector Building and Classification*

We propose four variants of our method based on the channels used. The first one (geometric G) uses only the skeleton data. The second one (G + D) uses the skeleton and the depth channel. The third variant (G + C) uses the skeleton and the color channels. The last variant uses all channels (G + D + C). We present first the feature vector for the geometric G approach. Let *n* be the number of joints in a skeleton, the spatio-temporal modeling presented in Section 3, and *k* the number of trajectories in the training set with labels *l*1, ... , *lk*. For a given sequence in the test set, the first step is to represent it as a trajectory in *SE*(3)*<sup>n</sup>*−<sup>1</sup> as described in Section 3. Then, the elastic framework is applied in order to compute the elastic distance from the given trajectory to each of the *k* training ones. As illustrated in the previous section, the use of the elastic metric in trajectories' comparison ensures a rate-invariant distance. The resulting vector of *k* distances represents the feature vector of the geometric approach (G). An additional trajectory in *SE*(3) must be considered when the depth data are used in the (G + D) approach. The feature vector has the same size; however, the trajectories are considered in *SE*(3)*n* rather than *SE*(3)*<sup>n</sup>*−1. When the color channel is considered, the output of the last layer in the deep network used for object detection is concatenated to the *k* distance in order to build the feature vector denoted by *FeatureV*. The steps of feature vector building and classification are illustrated in Algorithm 1.

The resulting feature vector is fed to the Hoeffding tree (VFDT) algorithm. The Hoeffding tree [36] or very fast decision tree (VFDT) is built incrementally over time by splitting nodes (into two) using a small amount of the incoming data stream. The number of samples considered by the learning to expand a node depends on a statistical method called the Hoeffding bound or additive Chernoff bound. The Hoeffding tree is constructed by making recursive splits of leaves from a blank root and subsequently getting internal decision nodes, such that a tree structure is formed. The splits are decided by heuristic evaluation functions that evaluate the merit of the split-test based on attribute values.

**Algorithm 1** Action sequences' classification.

1: **Input:**



6: **Begin**

7: **for** *i* = 1 to *k* + 1 **do**


*σ*(*k*+<sup>1</sup>) (*σ*(*k*+1,)))

to *k* **do**

*σ*(*i*) (*σ*(*i*)))


17: **end for**

18: *qk*+<sup>1</sup> = √

19: **for** *i* = 1

20: *qi* √


*cos*<sup>−</sup><sup>1</sup>(<sup>&</sup>lt; *q*∗*i* , *qk*+<sup>1</sup> >)

 √*γ*˙(*q*2 ◦ *<sup>γ</sup>*)).

26: label (*Sk* + 1) = VFDT (FeatureV)

## **5. Experimentation and Results**

In order to validate our method, an evaluation was conducted on three databases that represent different challenges, namely Microsoft Research (MSR) Action3D dataset [8], MSR-Daily Activity 3D [23], and the SYSU 3D Human-Object Interaction Set [44].

*5.1. MSR Action 3D*

### 5.1.1. Data Description and Protocol

The MSR-Action 3D dataset is a set of RGBD data captured by a Kinect. This dataset includes 20 actions performed by 10 different subjects facing the camera. Each action is performed two or three times, resulting in a total of 557 action sequences. 3D joint positions are extracted from the depth sequence using the real-time skeleton tracking algorithm proposed in [45]. All actions are performed without interaction with the objects. Two main challenges are identified: the strong similarity between the different groups of actions and the changes in the speed of execution of the actions. For each sequence, the dataset provides information about depth, color, and skeleton. As indicated in [8], ten sequences are not used in the experiments because the skeletons are missing or too erroneous. For our experiments, we use 547 sequences. In this dataset, we followed the same protocol of the cross topic of [8], in which half of the subjects are used for training and the other half for testing. Subjects 1, 3, 5, 7, and 9 are used for training and Subjects 2, 4, 6, 8, and 10 for testing. In [8], the sequences were

divided into three subsets *AS*1, *AS*2, and *AS*3, each containing eight actions. Sets *AS*1 and *AS*2 are intended to group actions with similar movements, while *AS*3 is intended to group complex actions.

### 5.1.2. Experimental Result and Comparison

No action, in this dataset, includes an interaction with the object. Thus, the skeleton-based approach (G) is performed. Table 1 reports the recognition performance on MSR-Action3D compared to several state-of-the-art approaches: joint positions (JPs) [14]: concatenation of the 3D coordinates of all the joints *v*1,. . . ,*vN*; pairwise relative positions of the joints (RJPs) [16]: concatenation of all the vectors; joint angles (JAs) [5]: concatenation of the quaternions corresponding to all joint angles (we also tried Euler angles and Euler axis-angle representations for the joint angles, but quaternions gave the best results); individual body part locations (BPLs) [46]: each individual body part is represented as a point in *SE*(3) using its rotation and translation relative to the global x-axis.

In the last row of Table 1, the average recognition rate for the three subsets *AS*1, *AS*2, and *AS*3 is reported. The recognition rates of our approach on *AS*1, *AS*2, and *AS*3 were 94.66%, 85.08%, and 96.76%, respectively. The accuracy on subset *AS*2 was lower than the two other subsets. This behavior is similar to the state-of-the-art approaches as revealed in Table 1. The average accuracy of the proposed representations was 92.16%, which is superior to the performance of previous state-of-the-art approaches provided in Table 1.

Table 2 compares the proposed approach with various approaches to recognizing human actions on skeletons using the protocol of [8]. Here, we see that our approach is competitive with the state-of-the-art with a recognition rate equal to 92.16%.



**Table 2.** Comparison with the state-of-the-art results.


### *5.2. MSR Daily Activity 3D*

### 5.2.1. Data Description and Protocol

The MSR Daily Activity 3D dataset [23] is a set of RGB-D sequences of human sequences acquired with the Kinect. It contains 16 types of activities: drink, eat, read book, call cellphone, write on a paper, use laptop, use vacuum cleaner, cheer up, sit still, toss paper, play game, lay down on sofa, walk, play guitar, stand up, sit down. Each of them was performed twice by 10 subjects [23]. The dataset contains 320 videos = 16 × 10 × 2 (10 actors and two essays/actor). There are 20 body joints recorded, whose positions are quite noisy due to two poses: "sitting on sofa" and "standing close to sofa". The experimental protocol is the same as in [23], which divides the dataset into three subsets, *AS*1, *AS*2, and *AS*3, as shown in Table 3.


**Table 3.** Subsets of actions, *AS*1, *AS*2, and *AS*3 in the MSR Daily Activity 3D dataset [23].

### 5.2.2. Experimental Result and Comparison

Table 4 reports the results of our algorithm on the MSR Daily activity dataset. The average recognition rate using only the skeleton data is 87.55%. When the dynamics of the object is considered, we have an average recognition rate equal to 88%. Combining the feature vector resulting from the geometry of the skeleton and object to the appearance of the object yields good improvement of the recognition rate. Actually, using the geometry of the skeleton (G) and the appearance of the object (C), the average recognition rate is 94.44%. The performance is also improved by using in addition the geometry of the object (D) to reach a 95% recognition rate, which is very competitive compared with recent state-of-the-art approaches.



### *5.3. SYSU 3D Human-Object Interaction Set*

### 5.3.1. Data Description and Protocol

In this dataset [44], twelve different activities focusing on interactions with objects were performed by 40 persons. For each activity, each participant manipulates one of the six different objects: phone, chair, bag, wallet, mop, and besom. Therefore, there are in total 480 video clips collected in this set. For each video clip, the data acquisition is done by a Kinect camera, and we have the corresponding RGB images, the depth sequence, and the skeleton. We tested all the methods compared with the second setting (Setting 2) [44]. The video footage made by half of the participants was used to learn the parameter model and the rest for the tests. We report the average precision and the standard deviation of the results on 30 random divisions For each parameter.

### 5.3.2. Experimental Result and Comparison

Table 5 provides the results of the different variants of the proposed approach compared to the state-of-the-art. When only the geometry of the skeleton data is considered, the recognition rate is 73.48%. This result is competitive compared to previous geometric approaches (based only on skeleton data). If we take into account the dynamics of the object, the recognition rate is equal to 74.51%. When the appearance of the object is considered in addition to the geometry of the skeleton, the recognition rate reaches 86.76 %, which represents the best recognition rate compared to the recent state-of-the-art approaches. The full version of the proposed approach, which makes use of all RGB-D and skeleton information, provides a recognition rate of 87.40%.

For further analysis of the obtained results, we illustrate in Figures 4 and 5 the confusion matrices on the SYSU dataset using the skeleton data (G) and the RGB-D (G + D + C) channels, respectively. The appearance of the object improves the performance of all actions, but the improvement is more considerable for sweeping and mopping actions: the skeleton data performs good for several actions, but seems not sufficient to distinguish other actions such as sweeping, with recognition rates of 35% and 54.9%, respectively. The skeleton motion during these two actions is similar to the motion while drinking, moving the chair, or pouring. The appearance of the object improves the performance for the sweeping and mopping actions. As shown in Figure 5, the recognition of sweeping improved by 36% by using the geometry and the appearance of the object in addition to the skeleton motion. The performance of mopping reaches 78.3% for the mopping action.


**Figure 4.** SYSU 3D Human-Object Interaction dataset confusion matrix based on skeleton data (G).


**Figure 5.** SYSU 3D Human-Object Interaction dataset confusion matrix based on skeleton, depth, and color data (G + D + C).

**Table 5.** Comparison on the SYSU 3D dataset. (D) depth; (C) color (or RGB); (G) geometry or skeleton.


### **6. Conclusions and Future Direction**

In this paper, we represent the inter-frame evolution of skeleton body parts in Lie group *SE*(3) × ... × *SE*(3). When an object is involved in the action, a neural network is used to detect the object at the first frame, then the evolution across frames is tracked, then similarly modeled as an additional trajectory in Lie group *SE*(3). The resulting trajectories are then mapped onto the Lie algebra, where they are compared using a re-parameterization-invariant framework in order to handle rate variations. The distances to training trajectories are concatenated with the output of the last layer of the neural network used for object detection, then are fed to the very fast decision tree to perform action recognition. We experimentally show that the proposed approach performs better than many previous approaches for human activity recognition. As future work, we expect widespread applicability in domains such as physical therapy and rehabilitation.

**Author Contributions:** Formal analysis, H.D. and I.R.F.; Methodology, M.B., H.D. and I.R.F.; Supervision, M.M. and I.R.F.; Validation, H.D., M.M. and I.R.F.; Visualization, I.R.F.; Writing–original draft, M.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.
