*3.1. Proposed Approach*

In this work, we propose a framework for human activity recognition using the body part-based skeleton for action recognition and object detection and object tracking for human–object interaction recognition. Figure 1 summarizes the proposed approach: First, skeleton and object sequences are represented as trajectories in the Lie group, and these trajectories are then mapped into the Lie algebra, then to a Riemannian manifold to be compared in a rate-invariant way. In addition to distances to training trajectories, the output of the last layer of the neural network used for object detection is also used, in some scenarios, to build the final feature vector. The classification is therefore performed using the Hoeffding tree ("very fast decision trees (VFDT)").

**Figure 1.** Overview of the proposed approach. VFDT, very fast decision trees.

Inspired by the work proposed in [2], which focused on the rigid transformations between different parts of the body within the same frame, we propose to model the evolution of the same part of the body (a segment) across frames by using the rotation and the translation necessary to transform the segmen<sup>t</sup> at frame *t* to the correspondent segmen<sup>t</sup> at frame *t* + 1. This geometric representation of the rotation and translation of the rigid body in 3D space is part of the special euclidean group *SE*(3) [34]. The evolution between two successive frames can be therefore modeled as a point in *SE*(3)<sup>×</sup>...<sup>×</sup>*SE*(3) *n* − 1 times, where *n* − 1 represents the number of body segments for a skeleton with *n* joints. A sequence of *N* frames is therefore represented by *N* − 1 points in *SE*(3)<sup>×</sup>...<sup>×</sup>*SE*(3) (*n* − 1 times) and can be modeled as a trajectory the in *SE*(3)<sup>×</sup>...<sup>×</sup>*SE*(3) (*n* − 1 times) manifold. When an object is considered, an existing neural network is used for object detection in the first frame (RGB of the object in sequence *i*; frame *j* denoted by *OBJ* − *RGB*(*i*)(*j*)), then the object is tracked during the sequence (depth of the object in sequence *i*; frame *j* denoted by *OBJ* − *Depth*(*i*)(*j*)) using the iterative closest point (ICP) algorithm. The rigid deformations of the object across frames creates an additional trajectory in *SE*(3) that is considered with the previous trajectory generated by the

skeleton motion, to yield a final trajectory in *SE*(3)<sup>×</sup>... ×*SE*(3) (*n* times). The next step is to map this trajectory to the corresponding Lie algebra *se*(3)<sup>×</sup>... <sup>×</sup>*se*(3), which is the tangent space at the identity element. The resulting trajectories lie in a euclidean space (Lie algebra) and incorporate the geometric deformations between body segments across frames. In order to compare their shapes independently of the execution rate, they are mapped to the shape space of continuous curves via the square root velocity manifold representation [35]. The classification is performed later using the Hoeffding tree (VFDT) [36] based on the elastic metric in the shape space.

### *3.2. Skeleton Motion Modeling*

Firstly, we present the spatio-temporal modeling of the sequences. For this, we describe the geometric relation between the part of the body (denoted by *par<sup>t</sup>*) at frame *ft* and the same part in succession frame *ft*+1. To do this, we use the rotation and translation required to move the current part to the position and orientation of the same part in the next frame, and we use the *procruste* function. This geometric transformation such as rotation and translation between two rigid body parts is a member of the special Euclidean group *SE(3)* [34] and defined by the following four by four matrix of the form: 

$$P(R, \vec{d}) = \begin{bmatrix} R & \vec{d} \\ 0 & 1 \end{bmatrix} \in SE(\mathfrak{A}) \tag{1}$$

where *d*- ∈ R<sup>3</sup> and *R* ∈ R3×<sup>3</sup> is a rotation matrix, which is a point on the special orthogonal group *SO*(3).

This geometrical transformation between two parts of the rigid body with two successive frames is represented by a point in *SE*(3). Obviously, all parts of the body are presented by a point of the Lie group *SE*(3)<sup>×</sup>...<sup>×</sup>*SE*(3), where × denotes the direct product between Lie groups. Therefore, the temporal transformation of the body parts can be modeled by a trajectory in the *SE*(3)<sup>×</sup>... ×*SE*(3) Lie group, as depicted in Figure 2.

**Figure 2.** Action as a curve in the Lie group.

The Lie group identity element *I*4 is defined by a four by four matrix. Mathematically, the tangent space to *SE*(3) at the identity element is symbolized by *se*(3), and it is considered to be the Lie algebra of *SE*(3). This tangent space is a six-dimensional space constructed by matrices of the form:

$$B = \begin{bmatrix} \mathcal{U} & \vec{w} \\ 0 & 0 \end{bmatrix} = \begin{bmatrix} 0 & -u\_3 & u\_2 & w\_1 \\ u\_3 & 0 & -u\_1 & w\_2 \\ -u\_2 & u\_1 & 0 & w\_3 \\ 0 & 0 & 0 & 0 \end{bmatrix} \in \text{se}(3) \tag{2}$$

where *w*- ∈ R<sup>3</sup> and *U* ∈ R3×<sup>3</sup> the skew-symmetric matrix. Thus, it can be presented as a six-dimensional vector: 

$$\text{vec}(B) = \begin{bmatrix} \mu\_1, \mu\_2, \mu\_3, w\_1, w\_2, w\_3 \end{bmatrix} \tag{3}$$

The exponential map for *SE*(3) is defined as exp*SE*(3) : *se*(3) → *SE*(3) and the inverse exponential map, defined as log*SE*(3) : *SE*(3) → *se*(3). Both are used to navigate between the manifold and the tangent space, respectively, given by:

$$\deg\_{SE(\mathfrak{J})}(B) = e^B, \log\_{SE(\mathfrak{J})}(P) = \log(P) \tag{4}$$

where *e* and *log* denote the matrix exponential and logarithm, respectively.

The geometric transformation between all the parts of two successive frames *ft* and *ft*+<sup>1</sup> can be represented as:

*<sup>δ</sup>*(*t*)=(*Pft*(1), *ft*+<sup>1</sup>(1)(*t*), *Pft*(2), *ft*+<sup>1</sup>(2)(*t*)... , *Pft*(*M*), *ft*+<sup>1</sup>(*M*)(*t*)) ∈ *SE*(3) × ... × *SE*(3), where *M* is the number of body parts. Using this representation, a skeletal sequence describes an action as a curve in *SE*(3) × ... × *SE*(3). One can not directly classify the action curves in the curved space *SE*(3) × ... × *SE*(3), according to [2]. In addition, temporal modeling approaches are not directly applicable to this space. For that, we will map the trajectory in *SE*(3) × ... × *SE*(3) to its Lie algebra *se*(3) × ... × *se*(3), the tangent space at the identity element *I*4. With this method, we will map all the trajectories from the Lie group to the same tangent space to the identity, and we argue that the mapped curves are quite faithful to the original curves because they are close to the identity element of the Lie group as they represent the transformations of the same body parts across successive frames. The resulting curve in the Lie algebra corresponding to *δ*(*t*) is given by:

$$\begin{aligned} \sigma(t) &= (\mathsf{vec}(\log(P\_{f\_l(1), f\_{l+1}(1)}(t))), \mathsf{vec}(\log(P\_{f\_l(2), f\_{l+1}(2)}(t)))) \\ &\dots, \mathsf{vec}(\log(P\_{f\_l(M), f\_{l+1}(M)}(t)))) \in \mathsf{se}(3) \times \dots \times \mathsf{se}(3) \end{aligned} \tag{5}$$

The dimension of the characteristic vector at any time *t* of *σ*(*t*) is equal to 6*M*. For this, the temporal representation of the action sequence is a vector of dimension 6 × *M* × *N*, where *M* is the number of parts (*M* = 19 parts), and *N* represents the number of frames in the sequence.
