*2.3. RGB-D-Based Development*

The depth image is robust to lighting changes. However, it loses some useful information, such as texture context, which is essential to distinguish certain activities involving human-object interactions. Recently, several works showed that to improve the recognition of activities with object interactions, it is also necessary to merge RGB sequences with depth images [26–33]. For example, in the work of Zhao et al. [33], combined descriptors based on points of interest, extracted from RGB sequences and depth sequences, were brought together to perform the recognition. Liu and Shao [26] used a deep

architecture to simultaneously merge RGB information and depth images; in [19], a set of random forests was used to merge spatio-temporal and human key articulations; Shahroudy et al. [27], used a structured density method by merging RGB information and skeleton indices, and in [29], the authors simply concatenated skeletal features and silhouette-based features to perform classification.

The review and analysis of current RGB-D action datasets revealed some limitations including size, applicability, availability of ground truth labels, and evaluation protocols. There is also the problem of dataset saturation, a phenomenon whereby the algorithms reported achieved a near-perfect performance.
