**1. Introduction**

Human Activity Recognition (HAR) based on the video content analysis approaches is gaining more and more interest thanks to the wide variety of possible applications, such as video surveillance, human-computer interfaces or monitoring of patients and elderly people in their living environments. An exemplary structure of the HAR system may consist of the following general modules: motion segmentation, object classification, human tracking, action recognition and semantic description [1]. If a focus is put to action recognition (exercise classification), it can be assumed that input data type, localised objects and their positions are known. Based on the taxonomy presented in [2] the techniques for action recognition are divided into holistic and local representations. Holistic solutions use global representations of human shape and movement, accumulating several features. The most popular solutions include Motion History Image and Motion Energy Image templates proposed by Bobick and Davis [3] or Space-Time Volume representation introduced by Yilmaz and Shah [4]. Local representations usually are based on interest points which are used to extract a set of local descriptors, e.g., Space-Time Interest Points by Laptev [5]. Instead of aggregating features from all frames, some researchers propose to extract only several foreground silhouettes, so called key poses (e.g., [6,7]). If binary silhouettes are used as input data, various shape features can be extracted and combined, such as shape and contour [8], orientation [9] or skeleton [10]. Apart from traditional approaches, more challenging tasks can benefit from the application of deep learning

techniques, such as Convolutional Neural Networks [11]. Ultimately, the choice of methods is dependant, among others, on the application scenario and data complexity.

A task of exercise classification can be related to the Ambient Assisted Living (AAL), which refers to concepts, products and services introducing new technologies for people in all phases of life, allowing them to stay healthy, independent, safe and well-functioning at their living environment. In the era of an ageing society and a significant proportion of older people living alone or unattended, expanding the range of care support options is becoming more and more important. Another major focus of AAL is prolonging the time people can live on their own, being in good health and in good physical shape. This is related to the increasing life expectancy and successful ageing. The World Health Organisation policy in Active Ageing applies to physical, mental and social well-being, and is defined in [12] as "the process of optimizing opportunities for health, participation and security in order to enhance quality of life as people age". Among people aged 45 and over, non-communicable diseases (NCDs) are the most frequent causes of mortality and disability all over the world. The risk of NCDs morbidity is higher in this age group; however, risk factors may originate in younger years. NCDs include, among others, cardiovascular diseases, hypertension, diabetes, chronic obstructive pulmonary disease, musculoskeletal conditions and mental health conditions. One of the risk factors is a sedentary lifestyle [12]. Low level of physical activity and lack of exercises can directly lead to obesity which increases the risk of NCDs as well. The study presented in [13] shows that greater physical fitness is associated with reduced risk of developing many NCDs. The authors of [14] advise promoting positive health behaviour rather than reducing negative ones, such as above-mentioned sedentary lifestyle. It is recommended to focus on the benefits of physical activity, provide motivation and promote self-care. Models based on social-cognitive behavioural theory are indicated as self-regulatory strategies that can contribute to increasing physical activity based on skills such as goal setting and self-monitoring of progress.

This paper follows the idea of active ageing and the use of activity monitoring solutions. However, the approach which is here proposed aims only at recognizing primitive actions that resemble some recommended exercise types [15], such as resistance, aerobic, stretching, balance and flexibility exercises. A specific scenario is assumed in which a person wants to do a workout in front of a laptop where a video with exercises is displayed. The laptop camera captures people's activities and the algorithm analyses them. The classification is performed in order to determine the amount, frequency and duration of a specific exercise. This, in some way, may encourage a person to engage in more physical activity. Due to presented reasoning, Section 2. Related works is focused on the methods and techniques used in action recognition approaches based on video content analysis. In our approach, we use foreground masks extracted from video sequences, each representing single person performing an action. Foreground masks carry information about an object's pose, shape and localisation. Therefore, various features can be retrieved and combined in order to create an action representation—here it is proposed to combine trajectory, simple shape descriptors and Discrete Fourier Transform (DFT).

The rest of the paper is organised as follows: Section 2 presents selected related works, that concern the recognition and classification of similar actions. Section 3 explains consecutive steps of the proposed approach together with applied methods and algorithms. Section 4 describes experimental conditions and presents the results. Section 5 discusses the results and concludes the paper.

#### **2. Related Works**

An action can be defined as a single person short activity composed of multiple gestures organised in time that lasts up to several seconds or minutes [1,16], e.g., running, walking or bending. Many actions can be performed for a longer time than several minutes, however due to their periodic characteristic only a short action span is used for recognition. The recognition process is here understood as assigning action labels to sequences of images [17]. Then, action classification can be based on various features, such as colour, grey levels, texture, shape or characteristic points like

centroid or contour. Selected features are numerically represented using specific description algorithms in a form of so-called representation or descriptor. According to [2], good representation for action recognition has to be easy to calculate, provide a description for as many classes as possible and reflect similarities between look-alike actions. There is a large body of literature on video-based action recognition and related topics investigating wide variety of methods and algorithms using diverse features. An interest is reflected in the still emerging surveys and reviews (e.g., [2,18–22]). Due to many techniques on action classification reported in the literature, here we refer to several works that correspond to our interests in terms of the methods and data used.

The authors of [23] propose a novel pose descriptor based on the human silhouette, called Histogram of Oriented Rectangles. The human silhouette is represented by a set of oriented rectangular patches, and the distribution of these patches is represented as oriented histograms. Histograms are classified by different techniques, such as nearest neighbour classification, Support Vector Machine (SVM) and Dynamic Time Warping (DTW), among which the last one turned out to be the most accurate. Another silhouette based feature was proposed in [24] which uses Trace transform for a set of silhouettes representing single period of action. The authors introduce two feature extraction methods: History Trace Templates (a sequence representation with spatio-temporal information) and History Triple Features (a set of invariant features calculated for every frame). The classification is performed using Radial Basis Function Kernel SVM and Linear Discriminant Analysis is applied for dimensionality reduction. Action recognition based on silhouettes is presented in [25] as well. All silhouettes in a sequence are represented as time series (using a rotation invariant distance function) and each of them is transformed into so-called Symbolic Aggregate approXimation (SAX). An action is then represented by a set of SAX vectors. The model is trained using the random forest method and various classification methods are tested. The authors of [26] propose a novel feature for action recognition based on silhouette contours only. A contour is divided into radial bins of the same angle using centroid coordinates and a summary value is obtained for each bin. A summary value (variance, max value or range) depends on Euclidean distances from centroid to contour points in every radial bin. The proposed feature is used together with a bag of key poses approach and tested in single- and multi-view scenarios using DTW and leave-one-out procedure.

The authors of [8,9] use accumulated silhouettes (all binary masks of an action sequence are compressed into one image) instead of every silhouette separately. In [8] various contour- and region-based features are combined, such as Cartesian Coordinate Features and Histogram of Oriented Gradients (HOG). SVM and K-nearest neighbour (KNN) classifiers are used (the latter one in two scenarios). In total, seven different features and three classifiers are experimentally tested. The highest accuracy is reported for a combination of HOG and KNN in leave-one-sequence-out scenario. In [9] an average energy silhouette image is calculated for each sequence. Then region of interest is detected and several features are calculated: edge distribution of gradients, directional pixels and rotational information. These feature vectors are combined in action representations which are then classified using SVM classifier. In [10], instead of accumulated silhouettes, the authors propose the cumulative skeletonised image—all foreground objects' skeletons of each action sequence are aligned to the centroid and accumulated into one image. Action features are extracted from these cumulative skeletonised images. In an off-line phase the most discriminant human body regions are selected and classified in an online phase using SVM. The authors of [27] propose a motion descriptor which describes patterns of neighbouring trajectories. Two-level occurrence analysis is performed to discover motion patterns of trajectory points. Actions are classified using SVM with different kernels or random forest algorithm. The approach proposed in [28] employs spectral domain features for action classification, however silhouette features are not involved. Instead, the two-dimensional Discrete Fourier Transform is applied to each video frame and a part of high amplitude coefficients is taken. For a given sequence, selected coefficients of all frames are concatenated into action representation. Larger representations can be reduced using Principal Component Analysis. Action classification is performed using SVM or a simple classifier based on Euclidean distance.
