**1. Introduction**

Human activity recognition has attracted many research groups in recent years due to its wide range of promising applications in different domains, like surveillance, video games, physical rehabilitation, etc. In order to develop systems for understanding human behavior, visual data form one of the most important cues compared to verbal or vocal communication data. Moreover, the introduction of low cost depth cameras with real-time capabilities, like the Microsoft Kinect, which provide in addition to the classical red-green-blue (RGB) image, a depth image, makes it possible to estimate in real time a 3D humanoid skeleton thanks to the work of Shotton et al. [1]. This type of data brings several advantages as it makes the background easy to remove and allows extracting and tracking the human body, thus capturing the human motion in each frame. Additionally, the 3D depth data are independent of the human appearance (texture), providing a more complete human silhouette relative to the silhouette information used in the past. Thus, new datasets with RGB-depth (RGBD) data have been collected, and many efforts have been made on human action recognition. However, human activity understanding is a more challenging problem due to the diversity and complexity of human behaviors, and less effort has been made by previous approaches. The interaction with objects creates an additional challenge for human activity recognition. Actually, during a human–object interaction scene, the hands may hold objects that are hardly detected or recognized due to heavy occlusions and appearance variations. The high level information of the objects is needed to recognize the human–object interaction. Taking a glance at the past skeleton-based human activity recognition approaches, we can distinguish two categories: the first family of approaches considers the skeleton data as body parts, and the second family considers them as a set of joints, as categorized by [2]. The scope of the paper is related to the first family of approaches that first consider the human skeleton as a connected set of rigid segments and either model the temporal evolution of individual body parts [3] or focus on connected pairs of body parts and model the temporal evolution of joint angles [4,5]. More recently, Vemulapalli et al. [2] proposed to model a skeleton by all the possible rigid transformations between its segments. In other words, for each skeleton, hundreds of rotations and translations (let us assume *L*) are computed between all the skeleton segments to yield *L* points on the special euclidean group *SE*(3). The evolution of the skeleton along frames generates a trajectory in *SE*(3)*<sup>L</sup>*. The trajectories are later on mapped to the Lie algebra (the tangent space on the identity point of the special euclidean group). The main limitation of this approach is the distortions caused by this mapping especially for points far from the identity element. The authors proposed an improvement of this method by the rolling-based approach in order to minimize the distortions in the tangent space (Lie algebra) in [6]. In this paper, we propose to investigate transformations of each skeleton part along frames in the Lie group and not within the same frame as [2,6]. Compared to [2] and [6], the proposed model represents three main advantages:


The main contributions of this work are:


The paper is organized as follow: We provide a brief review of the existing literature in Section 2 and discuss the spatio-temporal modeling in Section 3. Section 4 presents the rate invariance modeling and classification. We present our experimental results in Section 5 and conclude the paper in Section 6.
