**4. Experimental Conditions and Results**

The experiments were performed with the use of the Weizmann dataset [30], which consists of short video sequences that last up to several seconds (144 × 180 px, 50 fps). Foreground binary masks extracted from the database were made available by its authors and are here used as input data. There are masks for 93 sequences, however three of them are removed—one actor doubled three actions by moving with two different directions. In result, the database has 10 action classes and each action type is performed by nine different actors. During the experiments we follow the data processing steps presented in the previous section. Firstly, each frame is preprocessed individually and then all sequences are divided into two subsets—actions performed in place and actions with changing location of a silhouette. After preprocessing the number of images in a sequence varies from 28 to 146. The group of actions performed in place consists of five action classes: 'bend', 'jump in place', 'jumping jack', 'wave one hand' and 'wave two hands', whereas actions with changing location of a silhouette are: 'jump forward on two legs', 'run', 'skip', 'walk' and 'gallop sideways'. Figure 4 depicts some selected masks after preprocessing step (for two different actions). The next steps, including shape description, action representation and action classification, are performed separately in each subgroup. Ultimately, the classification accuracy values of both subgroups are averaged, which gives the final effectiveness of the approach.

**Figure 4.** Exemplary preprocessed masks for 'walk' (**top row**) and 'jumping jack' (**bottom row**) actions.

The main part of the experiments refers to the assumed application scenario, which is related to the Ambient Assisted Living and the concept of active ageing. In this scenario, a human activity analysis is performed to identify types of exercises. Physical activity is indicated as one of the methods of preventing the risk of developing non-communicable diseases. Based on the social-cognitive behavioural theory it is advised to promote self-care and incorporate self-monitoring of progress. Nowadays, video content analysis techniques and the popularity of cameras (in laptops, smartphones) facilitate the implementation of exercise monitoring solutions. In order to carry out the experiment concerning the recognition of types of exercises, we composed a database using selected classes from the Weizmann database. Action classes were compared with the recommended exercises presented in [15]. In addition, it was taken into account that exercises are supposed to be performed in a home environment. Due to that, the 'run' class is excluded. Moreover, there are two classes with waving action, therefore the 'wave with one hand' class is excluded as well. The remaining action classes may be associated with the following exercises (based on [15]):


Several experiments were carried out to investigate the best combination of methods and parameters for the approach. Twenty simple shape descriptors were tested in combination with three matching measures and the use of up to 256 Fast Fourier Transform coefficients. In order to focus only on the highest results and be able to appropriately present them, for each matching measure an experiment is performed with several tests, in which a selected simple shape descriptor and different number of coefficients are used. The results of exercise recognition are provided in Table 1. The best result is considered as the highest accuracy and the smallest action representation. The highest accuracy for actions performed in place is 100% if MBR width is used and the action representation contains 54 elements. It means that each action sequence, regardless of the number of frames, is represented using 54 Fast Fourier Transform coefficients and a representation has a form of a vector with 54 real values. The matching process can be then performed using Euclidean distance or C1 correlation. In total, 100% accuracy is also obtained for X/Y Feret, but more DFT coefficients are required. Actions with changing location of a silhouette are most successfully classified if MBR area is used and action representation contains 32 values—an accuracy of 94.44% is yielded. Again, either Euclidean distance or C1 correlation can be employed. Ultimately, the averaged correct classification rate for exercise types recognition is 97.2% (8 action classes).

**Table 1.** Experimental results for the recognition of exercise types using 8 classes of the Weizmann dataset. The results are presented separately for actions performed in place and actions with changing location of a silhouette. The highest accuracy values are listed with the indication of the applied simple shape descriptor and the size of an action representation (given in brackets).


A second set of experiments concerned the use of the Weizmann database as a benchmark and the comparison of the results for 10 classes with the previous version of our approach, presented in [29]. The results are presented in Table 2. For actions performed in place the highest accuracy is 86.67% (MBR perimeter, 52 DFT coefficients, Euclidean distance) and for actions with changing location of a silhouette the accuracy equals 95.56% (MBR area, 33 DFT coefficients, C1 correlation). If the use of different methods and parameters for each subgroup is assumed, the averaged accuracy for the entire database is 91.12%. This outperforms our previous approach based on simple shape descriptors, that resulted in 83.3% accuracy for actions performed in place (MBR width) and 85.4% (PAM eccentricity) for the other subgroup. The averaged accuracy equalled then 84.35%, which means that the current approach improves the accuracy by nearly 7%.

**Table 2.** Experimental results for the Weizmann dataset used as a benchmark, presented separately for actions performed in place and actions with changing location of a silhouette. The highest accuracy values are listed with the indication of the applied simple shape descriptor and the size of an action representation (given in brackets).


#### **5. Discussion and Conclusions**

In the second section of the paper, a description of related works is given. The methods described there, that is [8–10,23–28], were chosen for two main reasons—they concern action recognition and use the Weizmann database. However, the approaches used to represent a frame or a silhouette are diverse. In [23] a set of rectangular patches is used to represent a shape and in [26] only contour points are applied. Some researchers use various transforms, e.g., the authors of [24] apply Trace transform to binarized silhouettes, while in [28] the two-dimensional Fourier transform is applied to the original frames. The opposite approach is the use of cumulative silhouettes [8,9] or cumulative skeletons [10]. Some other techniques are dense trajectories based on salient points [27] and time series [25].

The approach proposed in this paper combines simple shape descriptors with the one-dimensional Fourier transform and standard leave-one-out classification procedure. Each action sequence is firstly described by a set of simple features and represented using a predefined number of Fourier coefficients. Classification is two-stage: firstly, actions are divided into two subgroups based on trajectory length, and secondly, leave-one-sequence-out cross-validation is performed. The proposed approach yields 97.2% accuracy in the assumed application scenario and 91.12% accuracy on the entire Weizmann database. The best results were obtained with the use of features based on a minimum bounding rectangle—its area, width and perimeter. These features are simple; however, if observed over time, they carry much more information about an action. Therefore, the input data can be limited to rectangular objects of interest, representing regions where silhouettes are located. These areas can be tracked over time to extract centroid locations. With these assumptions, the calculation of convex hulls may be omitted.

A comparison of some recognition rates of the proposed approach to other methods tested on the Weizmann dataset is presented in Table 3. Although our approach does not provide a perfect accuracy, it can be compared with some other solutions. It should be mentioned that the presented methods may assume other application scenarios and experimental conditions. Moreover, if we limit the number of classes, it does not always improve the results, which is proven in our experiments. When the classification of actions with changing location of a silhouette is performed for 10 classes, the highest accuracy is 95.56%, while for the limited number of classes it decreases to 94.44%. According to that, we especially refer to the results presented in [8,23,28] that outperformed our results obtained in the experiment concerning the assumed application scenario. The authors of [28] also employ spectral domain features, however these features are extracted from video frames using the two-dimensional Fourier Transform. In our approach the frames are represented using simple shape descriptors, which for each sequence are concatenated into a vector, and the one-dimensional Fourier Transform is applied. Therefore, the initial data dimensionality is lower. The descriptor proposed in [23] requires the extraction of rectangular regions from a human silhouette which may be problematic in case of imperfect silhouettes. The approach proposed in [8] uses accumulated silhouette representation, which requires all foreground masks from an action sequence. In our approach each foreground mask is represented separately, therefore in the case of the real-time scenario the proposed approach can be adjusted to utilise fewer frames.

The proposed approach has some advantages—it can be adapted to different action types by selecting other shape features and matching measures. Action representations are small and easy to calculate because simple algorithms are applied. Moreover, if another distinctive feature is found, instead of centroid or in addition to it, the recognition space could be narrowed in a more efficient manner and eliminate misclassifications. The use of different methods in each subgroup improves overall results. The presented version of the approach is promising; however, an improvement is needed. Our future works include experiments using other databases with larger number of classes corresponding to different exercises. Moreover, recently popular solutions based on deep learning will be tested as well.


**Table 3.** Comparison of recognition rates obtained on the Weizmann database (cited methods are explained in Section 2. Related works).

**Author Contributions:** Conceptualization, K.G. and D.F.; methodology, K.G.; software, K.G., validation, K.G. and D.F.; investigation, K.G.; writing—original draft preparation, K.G.; writing—review and editing, K.G. and D.F.; visualization, K.G.; supervision, D.F. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.
