Recognition and Scoring Physical Exercises via Temporal and Relative Analysis of Skeleton Nodes Extracted from the Kinect Sensor

Esmaeeli, Raana; Valadan Zoej, Mohammad Javad; Safdarinezhad, Alireza; Ghaderpour, Ebrahim

doi:10.3390/s24206713

Open AccessArticle

Recognition and Scoring Physical Exercises via Temporal and Relative Analysis of Skeleton Nodes Extracted from the Kinect Sensor

by

Raana Esmaeeli

¹,

Mohammad Javad Valadan Zoej

¹

,

Alireza Safdarinezhad

^2,*

and

Ebrahim Ghaderpour

^3,*

¹

Department of Photogrammetry and Remote Sensing, Faculty of Surveying Engineering, K. N. Toosi University of Technology, Tehran 19967-15443, Iran

²

Department of Geodesy and Surveying Engineering, Tafresh University, Tafresh 39518-79611, Iran

³

Department of Earth Sciences, Sapienza University of Rome, Ple Aldo Moro, 5, 00185 Rome, Italy

^*

Authors to whom correspondence should be addressed.

Sensors 2024, 24(20), 6713; https://doi.org/10.3390/s24206713

Submission received: 28 September 2024 / Revised: 15 October 2024 / Accepted: 18 October 2024 / Published: 18 October 2024

(This article belongs to the Section Sensor Networks)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Human activity recognition is known as the backbone of the development of interactive systems, such as computer games. This process is usually performed by either vision-based or depth sensors. So far, various solutions have been developed for this purpose; however, all the challenges of this process have not been completely resolved. In this paper, a solution based on pattern recognition has been developed for labeling and scoring physical exercises performed in front of the Kinect sensor. Extracting the features from human skeletal joints and then generating relative descriptors among them is the first step of our method. This has led to quantification of the meaningful relationships between different parts of the skeletal joints during exercise performance. In this method, the discriminating descriptors of each exercise motion are used to identify the adaptive kernels of the Constrained Energy Minimization method as a target detector operator. The results indicated an accuracy of 95.9% in the labeling process of physical exercise motions. Scoring the exercise motions was the second step after the labeling process, in which a geometric method was used to interpolate numerical quantities extracted from descriptor vectors to transform into semantic scores. The results demonstrated the scoring process coincided with the scores derived by the sports coach by a 99.5 grade in the R² index.

Keywords:

human activity recognition; depth sensor; Kinect; mutual information; constrained energy minimization; scoring; physical exercise

1. Introduction

Computer games and entertainment have evolved in today’s world. The control of games is done through physical interaction instead of joysticks [1,2]. Virtual reality glasses and 3D sensors, such as Kinect, are examples of hardware used for controlling interactive games [3,4]. Alongside advancements in the gaming and entertainment industry, 3D sensors and virtual reality glasses are also being utilized as tools for smart classrooms, the metaverse, virtual twins, medical science, automated guided vehicles, and 3D modeling [5,6,7,8,9,10,11,12,13,14,15,16].

Kinect is a sensor that captures red–green–blue (RGB) and depth videos simultaneously. The depth data are generated using Time of Flight (TOF) detectors or by matching projected NIR patterns and photogrammetric triangulations, depending on the version [17,18]. Accordingly, the digital depth data numbers are either the quantized values of TOF measurements or disparities. These values have an inverse relation with distances between objects and the sensor [17]. The depth videos have been a valuable data source for various applications. Examples are fusing depth images to complete 3D models [19], positioning virtual objects within captured scenes [20], interior layout design [21], virtual clothes put on [22], the creation of interactive classrooms [4], and motion recognition [1].

Real-time control of computer games or automatic evaluation of dance or physical exercises via depth videos are applications that require the initial identification and extraction of the human body and its skeletal structure [4,23,24]. Skeletal structure extraction involves identifying at least 20 3D joint points of the body, as shown in Figure 1, from each depth video frame. Tracking these points can help distinguish human actions and movements.

Several methods have been developed to recognize human body parts and extract their skeletal structures from either RGB or depth images. While the methods using single RGB images can usually achieve 2D body parts estimation without or with weak depth information, which limits the performance of human action analysis, depth image-based methods could overcome more of these deficiencies. For example, Sideridou et al. [25] used RGB videos captured by smartphones and the BlazePose body pose estimation model for assessing exercises. Due to the uncertainties of the depth component of the extracted body parts, they were forced to change the viewpoint of the video camera to record the exercises that were not reliable when captured from other viewpoints [25]. Hence, human 3D skeleton detection from depth videos will probably be a better choice for human action recognition purposes. Shoton et al. [26] detected human body parts in depth images using a trained Random Forest segmentation, dividing them into 31 contiguous pieces, some of which were merged before identifying the skeletal points. In this method, the body’s joint locations are identified by calculating the centroid of each segment using the moving average method with the Gaussian kernel. In general, the accuracy and quality of methods for identifying skeletal points depend on the level of noise in the depth images and the level of body occlusions. Yeung et al. [27] attempted to solve the challenges of occlusion in extracting 3D body skeletal structures from single-view depth images by simultaneously using two Kinect sensors with orthogonal viewing angles. They first identified the occluded body joints in the initial depth image and then replaced them with the corresponding joints obtained from the second view while considering certain connectivity constraints. Irrespective of the goal to complete occluded body parts using bi-view depth videos, this approach can also improve the accuracy of detecting visible joints in both videos. Relative calibration and temporal synchronization between depth sensors are prerequisites for the success of this method. Cai et al. [28] also proposed a different method for extracting skeletal joint points. After applying a median filter to reduce the noise in the depth images, they extracted edges to separate the human border from its background. In this method, morphological filters were used in depth images to reveal the initial skeletal structure of a human. Then, the positions of the joints were extracted by using knowledge about the relative size of the body’s skeletal components and fitting the triangle with the maximum area to the curves achieved in the previous step. This method also faces challenges, like the method by Shoton et al. [26], when dealing with occlusions of the skeletal structure. Kuang et al. [29] utilized a two-level threshold approach to separate the overlapping parts of the human body in the depth image. Subsequently, they merged the skeletal points extracted from each threshold stage to form a complete skeletal structure. Li et al. [30] developed a post-processing method to refine the skeletal structure that was extracted using the method by Shoton et al. [26]. They utilized knowledge about the relative distances between the body’s joints to prevent noise in the skeletal structure. In this paper, we utilized the method proposed by Shoton et al. [26] to determine the three-dimensional position of the body’s joints, available in Microsoft Kinect SDK (https://www.microsoft.com/en-us/download/details.aspx?id=44561, last accessed on 26 September 2024).

On the other hand, some studies have concentrated on recognizing motion or actions by analyzing 3D skeletal sequences [1,6,7,8,10,12,13,24,31,32,33,34,35,36,37]. Some works focused on using contact or wearable devices to measure the raw data of human actions [38]. On the other hand, remote measurements of 3D human skeletal sequences are an alternative data source, and some developed methods using such data sources are reviewed in more detail hereafter.

Wang et al. [31] proposed a method using data mining and machine learning to identify actions in 3D skeletal sequences obtained from depth images. In this approach, Local Occupancy Pattern (LOP) and temporal features extracted from the Fourier temporal pyramid are utilized to differentiate and categorize actions. Zhao et al. [32] refined the sequence of 3D skeletal points based on viewing angle and anthropometry. They then measured similarities with the training data using the Dynamic Time Warping (DTW) method. In this method, it is possible to tag different gestures in real time. However, it may face challenges when processing long sequences. Wang and Wu [33] also used the Learning Maximum Margin Temporal Warping method to detect motion in depth videos. This approach was developed to address the temporal asynchrony of actions and inconsistencies in sequence duration when comparing depth videos. Kerola et al. [34] identified 3D key points in both individual skeleton frames and integrated 3D skeletal points over time. They then utilized the Spectral Graph Wavelet Transform (SGWT) method to create descriptors for classifying human activities in depth videos. In this study, the classifiers used were a combination of PCA and SVM. Chiang et al. [35] utilized Kinect sensors to monitor and provide feedback to patients performing intervention or rehabilitation exercises. They simultaneously captured the 3D skeletal nodes using both a Kinect sensor and a motion capture (MOCAP) system during the training stage. They then trained a Gaussian process regression model to increase the reliability of their measurements. In this method, live-captured Kinect sequences were divided into subsequences, and each subsequence was compared with its corresponding reference subsequence. Pang and Liang [1] employed relative features among skeletal joint points to extract temporal features. They utilized a modified version of the DTW algorithm to compare feature vectors to measure their similarity. Salim et al. [36] also used Support Vector Machine (SVM) and Random Forest techniques to classify 14 human activities by extracting 46 features from human body joints captured with the Kinect sensor. Pham et al. [24] also designed two neural networks entitled Attention-enhanced Adaptive Graph Convolutional Network with Feature Fusion (FF-AAGCN) and Double-feature Double-motion Network (DD-Net) for labeling nine different physical exercise motions. The results and data used in this research have also been used in the evaluation of the proposed method in this article. Li et al. [36] proposed a method for the recalibration of Memory Attention Networks (MANs) and deployed the Temporal Attention Recalibration Module (TARM) and Spatio-Temporal Convolution Module (STCM) for skeleton-based action recognition. They showed that MANs can improve the performance of action recognition. However, their need for huge data in the training process is a common limitation of the deep learning approaches. Irrespective of the wearable devices used in [38] for action recognition, Haghi et al. used three neural networks for human activity recognition in approximately real time. Whereas the high speed in response to the trained network can be achieved, the necessity of numerous training data, overfitting problems, and the time of training are the main limitations of artificial intelligence (AI) methods. Sideridou et al. [25] also developed a smartphone application based on kinematic machine learning to recognize the performance quality of exercise movements and guide people to perform them correctly by writing some messages on the application screen. Extracting temporal and frequency domain features from the time series of some angles in the human skeletal structure has been their data source for separating and evaluating exercises. They also did not try to assess the performance quality of the exercises with the scores given by a coach.

Generally, the main sources of data in human motion recognition are the direct usage of the time series of the skeletal nodes’ motions or extracting features from them [25,32]. The phase differences or differences in execution speeds among the recorded exercise movements are the challenges when comparing the time series. Synchronization and phase removal are the necessary preprocessing steps before comparing time series, which has been done in different ways so far [25,32]. The use of mutual information descriptors has not been utilized to mitigate these issues when comparing the time series of skeletal nodes. On the other hand, the development of an approach for scoring the physical exercises’ performance compatible with the coach’s decision was not addressed in previous research. In this paper, we proposed a method to discriminate physical exercises and score them in terms of correctness in performing through the sequences of the human skeletal structures acquired by the Kinect depth sensor. In this method, the significant descriptors that represent distinct relationships between temporal features extracted from the skeletal structure for each physical exercise are used as bases in labeling and scoring. The advantage of this method is the independence of labeling to the temporal contraction or expansion in performing physical exercises. In the proposed method, we first extract several temporal features from the 3D skeletal joint points and then generate descriptive vectors based on the mutual information (MI) among the extracted temporal features. Each physical exercise is examined by its responses to different adaptive Constrained Energy Minimization (CEM) kernels that are specifically generated for all classes to find its label. The adaptive CEM kernels are generated by distinct components of MI descriptors for each class. The score for each labeled exercise is calculated using a simple geometrical metric that compares cross-correlation descriptors of labeled exercises among the selected training samples.

2. Materials and Methods

Action recognition and scoring are necessary in various applications such as interactive classrooms, medical rehabilitation, and controlling computer games [1,2,3,6,7]. Another application of action recognition is interactive dance or physical exercises taught through game consoles and computers, which is the subject of this research. The proposed method in this paper tries to first recognize the type and then the score of physical exercises performed in front of the Kinect sensor by comparing them to the correct samples. The flowchart in Figure 2 shows the detailed procedure of the proposed method. In this flowchart, dashed arrows indicate the process of using the training data, while solid line arrows illustrate the process of using the proposed method to recognize and score unlabeled exercises.

The proposed method consists of eight steps. The first step involves extracting 3D skeleton points in each frame of the depth videos. Since we used the method described by Shotton et al. [26] for extracting 3D skeleton joint points in depth frames, we opt not to elaborate on this method and refer the readers to the cited source for further information. In the second step of the proposed method, 36 instantaneous features are extracted from skeletal points in each frame, including 26 angular features and 10 features based on Normalized Difference Distance Indices (NDDIs). In the third step, the pairwise relative relationship among the extracted features in specific time intervals is quantified through MI and Pearson’s Correlation Coefficient (CC) indices to form two 630-component descriptor vectors. The MI descriptor vectors are utilized in the labeling process, while the CC descriptors are employed in the scoring process. The CC is utilized in many applications to study the linear dependency between two variables, e.g., see [39,40] for its definition and interpretation of its value ranges. Therefore, in the fourth step, the most distinctive components of the MI descriptor vectors are selected for discriminating each exercise from the others. In the fifth step, the adaptive target detector kernels are independently estimated for all classes through the CEM method. Then, in the sixth step, the kernels’ response is calculated for an unlabeled sequence of 3D skeleton motion to find the appropriate label. The labels of unknown sequences are determined based on both the winning response of the adaptive CEM kernels and the analysis of their CC descriptors. The entire 3D skeletal video is smoothly classified by repeating the fifth and sixth steps for all the sliding windows of a sequence. In the final step of the proposed method, the performance of the winning exercise is evaluated by comparing it with the ideal and incorrect performances using geometric measures calculated from the CC descriptors. Each step of this method is described in detail in the following subsections.

2.1. Angular and NDDI Features Extraction

Considering

S N (t) = {[p_{1} p_{2}, \dots, p_{20}]}^{T}

as the 3D positions vector of skeletal nodes extracted from the depth frame at moment

t

, where

p_{i}

is the 3D position of the

i

th node (Figure 1), the angular feature and Normalized Difference Distance Index at this moment are calculated through Equations (1) and (2), respectively.

α_{t} (p_{i}, p_{v}, p_{j}) = \arccos (\frac{d^{2} (p_{i}, p_{v}) + d^{2} (p_{j}, p_{v}) - d^{2} (p_{i}, p_{j})}{2 d (p_{i}, p_{v}) d (p_{j}, p_{v})})

(1)

{N D D I}_{t} (p_{i}, p_{j}, p_{m}, p_{n}) = \frac{d (p_{i}, p_{j}) - d (p_{m}, p_{n})}{d (p_{i}, p_{j}) + d (p_{m}, p_{n})}

(2)

where

α_{t} (p_{i}, p_{v}, p_{j})

is the famous cosine formula for a triangle, and

d (p_{i}, p_{j})

represents the Euclidean distance between two 3D nodes,

i

and

j

. In Equation (1), parameter v represents the vertex point of the angle. The angular features range from 0 to 180 degrees, and the NDDIs’ range is

[- 1,1]

.

In this research, 36 features are extracted at each moment from the sequence of skeletal nodes, as shown in Table 1. Our goal was to design these features so that they could effectively be used to distinguish among the exercises. The numbering of the skeletal nodes in Table 1 corresponds to Figure 1. After generating the temporal features

f_{1} (t)

to

f_{36} (t)

as per Table 1, the next step involves quantifying all their pairwise relationships, detailed in the following subsection.

All the features presented in Table 1 can simply be visualized in Figure 1 by drawing the connections based on the node numbers. In general, the combinations of constructing angular features and NDDIs were selected based on the visual analysis of the trajectory of the skeletal nodes during exercise movement executions.

2.2. Generating the CC and MI Descriptors

Considering the time interval

[t_{1}, t_{2}]

and the time interval between successive frames (

d t

) being equal to the frames per second (

f p s

) of the depth video, a vector, as shown in Equation (3) for

i = 1, \dots, 36

, can be generated for each of the features listed in Table 1.

F_{i} (t_{1}, t_{2}) = {[f_{i} (t_{1}), f_{i} (t_{1} + d t), f_{i} (t_{1} + 2 d t), \dots, f_{i} (t_{2} - 2 d t), f_{i} (t_{2} - d t), f_{i} (t_{2})]}^{T}

(3)

As seen in Equation (3), the number of extracted components for each of the temporal features listed in Table 1 is a function of the length of the sequence (

Δ t = t_{2} - t_{1}

) and the

f p s

of the depth videos. Therefore, alterations in the duration of the 3D skeleton sequence can influence both the length and the signature recorded within each feature vector. In the third step of the proposed method, we tried to produce descriptors with less dependence on the feature vectors’ length by quantifying their pairwise relative relationship. The advantage of this approach lies in its ability to mitigate the effects of temporal shifts and drift during the matching process between sequences of 3D skeleton motions. We employed the well-known cross-correlation (CC) coefficients, as outlined in Equation (4), and the mutual information (MI) index, detailed in Equation (5), to compute relative pairwise descriptors among the feature vectors, as described in Equation (3).

C C (F_{i}, F_{j}) = \frac{σ_{F_{i}, F_{j}}}{σ_{F_{i}} σ_{F_{j}}} i, j = 1,2, \dots, 36 and i \neq j

(4)

M I (F_{i}, F_{j}) = E (F_{i}) + E (F_{j}) - J E (F_{i}, F_{j}) i, j = 1,2, \dots, 36 and i \neq j

(5)

In Equation (4),

σ_{F_{i}, F_{j}}

is the covariance between the feature vectors F_i and F_j, and

σ_{F_{i}}

and

σ_{F_{j}}

are their standard deviations [41]. In Equation (5),

E

and

J E

are the well-known entropy and joint entropy operators, respectively [42]. In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the amount of information obtained about one random variable by observing the other random variable.

Given the 36 features extracted in Table 1, a total of

630 + 630 = 1260

MI and CC descriptors could be generated for each sequence of 3D skeletal nodes, independent from of the length of its feature vectors. In the estimation of the MI descriptors, the known range of the features is quantized to increase the computational speed. To do so, the known range of angular indices is quantized into 10-degree intervals, and the range of NDDI is divided into twenty parts.

2.3. MI Descriptor Selection for Each Exercise Motion

In the proposed method, for each physical exercise, a separate set of discriminating MI descriptors is selected. The descriptor selection strategy is designed in a way that finds the descriptors that can discriminate each exercise motion from the others.

Given

M I D_{i, j} = {[M I_{i, j, 1}, M I_{i, j, 2}, \dots, M I_{i, j, 630}]}^{T}

as the descriptor vector of the

j

th training sample of the

i

th physical exercise, the discriminative potential of each of its components to separate the

i

th exercise can be quantified by Equation (6).

{D I}_{i, k} = \frac{|{\bar{M I}}_{i, *, k} - {\bar{M I}}_{\neq i, *, k}|}{S T D ({M I}_{i, *, k}) + S T D ({M I}_{\neq i, *, k})} k = 1,2, \dots, 630

(6)

where

{\bar{M I}}_{i, *, k}

and

S T D ({M I}_{i, *, k})

are the average and standard deviation of the

k

th component of the MID vectors for all the training examples of the

i

th, respectively. Also,

{\bar{M I}}_{\neq i, *, k}

and

S T D ({M I}_{\neq i, *, k})

are the average and standard deviation, respectively, of the

k

th component of the MID vectors for all training samples except the

i

th exercise. The symbol

|\cdot|

in this equation is the absolute operator. According to the Equation (6), when the

{\bar{M I}}_{i, *, k}

of the

i

th exercise has had a large difference in

{\bar{M I}}_{\neq i, *, k}

of the other exercise motions, the values of the DI index will be large. In this equation, the STD values act as emphasizers for the selection of descriptors with lower deviations.

The

D I

is calculated for all 360 components of the

M I D

vector for each physical exercise, and then, they are sorted (prioritized) in descending order. In the following, if there are no significant changes in prioritization, an attempt is made to refine the order of the indicators by considering the periodicity in the sign of the

D I

numerator. This is done to reduce the selection of descriptors with similar discrimination behavior. After possible refinement in the prioritization of descriptors, the components corresponding to a certain number (nD) of

D I

s with higher priority among the

M I D

vector components are selected as the set of distinguishing descriptors for each exercise motion.

2.4. Estimation of the Adaptive CEM Kernels

Constrained Energy Minimization is a well-known target detection method in which identifying a detector kernel vector (

\vec{H_{q}}

) for each target quantifies the possible presence of the target in an unknown descriptors vector

\vec{D_{q}} (t_{1}, t_{2})

by its response that is estimated through a simple dot product, as represented in Equation (7). Here, the targets will be exercising motions. In the CEM method,

\vec{H_{q}}

is estimated in such a way that the sum of the squares of the detector responses is minimized for other descriptor vectors, except the target [42].

{T R}_{q} (t_{i}, t_{j}) = \vec{H_{q}} (t_{i}, t_{j}) \cdot \vec{D_{q}} (t_{i}, t_{j}), t_{j} > t_{i}

(7)

In Equation (7),

\vec{H_{q}} (t_{i}, t_{j})

represents the detector kernel vector associated with the target (

q

) over the time interval

Δ t = [t_{i}, t_{j}]

within the video footage of 3D skeletal nodes. Similarly,

\vec{D_{q}} (t_{i}, t_{j})

and

{T R}_{q} (t_{i}, t_{j})

denote the vector of selected descriptors and the detector response corresponding to the target (

q

) during the same interval, respectively. Additionally, the notation “

\cdot

” signifies the dot product operation.

According to Equation (6), in the proposed method, for each unlabeled skeletal 3D nodes sequence between

[t_{i}, t_{j}]

and each physical exercise (target), a detector kernel is adaptively identified according to the corresponding target training samples and the elapsed sequences

[t_{0}, t_{j}]

. To do so, selected descriptors for each target of all the training samples and the elapsed sequences will be needed. Given the set

K_{q} = \{k_{q, 1}, k_{q, 2}, \dots, k_{q, n D}\}

containing the index numbers of the selected

M I

descriptors related to exercise

q

,

{\vec{T S D}}_{i, j, k \in K_{q}} = {[{M I}_{i, j, k_{q, 1}}, {M I}_{i, j, k_{q, 2}}, \dots, {M I}_{i, j, k_{q, n D}}]}^{T}

is a vector containing the selected

M I

s of the

q

th exercise for the

j

th training sample of the

i

th exercise. On the other hand,

\vec{D_{q}} (t_{i}, t_{j}) = {[{M I}_{k_{q, 1}} (t_{i}, t_{j}), {M I}_{k_{q, 2}} (t_{i}, t_{j}), \dots, {M I}_{k_{q, n D}} (t_{i}, t_{j})]}^{T}

is also the vector of the selected

M I

s for the

q

th physical exercises in the time interval

[t_{i}, t_{j}]

from the sequence of unlabeled 3D skeletal nodes. Accordingly, the adaptive detector kernel of the physical exercise

q

will be estimated through Equation (8) in the CEM method.

\vec{H_{q}} (t_{i}, t_{j}) = \frac{R_{q}^{- 1} (t_{i}, t_{j}) \times g_{q}}{g_{q}^{T} {\times R}_{q}^{- 1} (t_{i}, t_{j}) {\times g}_{q}}

(8)

In Equation (8),

R_{q} (t_{i}, t_{j})

is the adaptive covariance matrix of the training samples and the elapsed sequences for the

q

th target selected descriptors (Equation (9)). This matrix was designed to simultaneously be affected by both the training samples and the elapsed depth videos as the background context. Vector

g_{q}

is the representative vector of target

q

containing the mean values of its training samples of the selected descriptors (Equation (10)).

\begin{matrix} R_{q} (t_{i}, t_{j}) = \frac{1}{n T} \sum_{i = 1}^{n T} \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} {\vec{T S D}}_{i, j, k \in K_{q}} \times {\vec{T S D}}^{T}_{i, j, k \in K_{q}} \\ + \frac{d t}{t_{i}} \int_{t = 0}^{t_{i}} \vec{D_{q}} (t, t + ∆ t) \times {\vec{D_{q}} (t, t + ∆ t)}^{T} d t \end{matrix}

(9)

g_{q} = \frac{1}{n_{q}} \sum_{j = 1}^{n_{q}} {\vec{T S D}}_{q, j, k \in K_{q}}

(10)

In Equations (9) and (10),

n T

is the total number of physical exercises (targets), and

n_{q}

is the number of training samples of the

q

th exercise. Variable

∆ t

is the time interval for separating the sequence of 3D skeleton nodes in the process of labeling the video obtained from physical exercises. Also,

d t

is the sliding step of the time interval for the successive generation of vectors

\vec{D_{q}} (t_{i}, t_{j})

to be used in the process of labeling and scoring. The minimum value for

d t

can be set equal to the fps of the depth videos taken by the Kinect sensor.

From a theoretical point of view, the covariance matrix in the CEM method,

R_{q} (t_{i}, t_{j})

, should represent the statistical distribution of the background data [43,44]. Two different types of background motions might be present in the recorded depth videos when attempting to recognize a target exercise: irrelevant targets or non-target motions. Hence, the covariance matrix of the CEM kernels was designed to have a balanced effect from both types of backgrounds. Accordingly, the first nominal of Equation (9) is the target affected part of the covariance matrix, which is simply reconstructed by selected descriptors of the training samples. The second nominal of Equation (9) is the adaptive part of

R_{q} (t_{i}, t_{j}),

which reconstructs the statistical characteristics of the elapsed exercise motions (non-target backgrounds), before investigating

\vec{D_{q}} (t_{i}, t_{j})

. This will cause a better representation of the samples’ distribution in the descriptor space and reduce the possibility of the singularity in the

R_{q}^{- 1} (t_{i}, t_{j})

estimation when finding detector kernels.

2.5. Labeling of Unknown Sequences

After selecting the discriminating descriptors and identifying the CEM detector kernels for every physical exercise, as outlined in the flowchart presented in Figure 2, the procedure for labeling unknown sequences will commence, following the solid arrows depicted. This procedure can be easily executed on a long sequence of depth videos containing extracted 3D skeletal nodes. To achieve this, the entire sequence of 3D skeleton nodes must be divided into recognized ∆t intervals using a sliding temporal window. Following this, by extracting the features listed in Table 1 for each frame, the selected descriptors for each targeted exercise will be produced in the form of

\vec{D_{q}} (t_{i}, t_{j})

(

q = 1,2, \dots, n T

) vectors. Then, the response vector of the CEM detector kernels, Equation (6), is calculated as represented in Equation (11).

T R (t_{i}, t_{j}) = {[{T R}_{1} (t_{i}, t_{j}), {T R}_{2} (t_{i}, t_{j}), \dots, {T R}_{n T} (t_{i}, t_{j})]}^{T}

(11)

In the proposed method, the labeling of each sequence is performed through the simultaneous provision of two criteria. In this process, the initial label of each sequence

L (t_{i}, t_{j})

is identified through the maximum value of the

T R (t_{i}, t_{j})

vector; see Equation (12).

L (t_{i}, t_{j}) = a r g m a x (T R (t_{i}, t_{j})) L \in \{1,2, \dots, n T\}

(12)

Then, the initial labeling should be confirmed by comparing the CC descriptors as the second criterion. To accomplish this, the Pearson cross-correlation between the CC descriptor vectors of the initially labeled sequences and the training data representing the winning label must exceed 50%.

2.6. Scoring on the Performance of Exercises

By finding the label of each sequence, its performance quality is scored by comparing the CC descriptors to their corresponding training samples. However, due to the possible inconsistency between semantic scoring assigned by the exercise referees and similarity metrics between CC descriptors, finding a conversion between the two will be necessary. The process of scoring requires both labeled and scored training samples for each class. Each labeled sequence is scored by its descriptor vector’s location among the corresponding descriptors of the scored training samples. Since the imperfect executions of an exercise motion can alter its descriptor vectors’ values, some of the training samples should be selected from the imperfect executions. The concept behind the scoring process in our method is simply visualized in Figure 3. We have shown the descriptors space (630D) in a 3D space for simplicity in comprehension.

The stars in Figure 3 schematically represent the position of the vectors containing CC descriptors of the training samples related to the winning label of an exercise motion sequence. The red star is related to the position of the training sample with a perfect performance (score of 100), and the blue stars show the training samples with imperfect executions (scores less than 100). Here, the score of each star is displayed with

S_{b} = 100

and

S_{w_{1}}

,

S_{w_{2}}

, and

S_{w_{3}}

. As seen, three imperfect training samples have been drawn in Figure 3; however, the number of which can be different according to the available training samples in the scoring process. The yellow cross is the position of an exercise descriptors vector with the same label of drawn stars which score is unknown. The perpendicular distance of the yellow cross to the lines connecting each imperfect training sample to the perfect one is displayed with values

v_{1}

to

v_{3}

. During the scoring process, the line with the smallest perpendicular distance to the yellow cross is chosen as the reference for scoring. According to Figure 3, two distances: “

d

” (the distance between the vertical projection of the yellow cross and the red star position) and “

D

” (the distance between the red star and the associated blue one from the selected line) will be used to calculate the score. The mentioned

d

(vertical projection of (a) the vector connecting the red star to the yellow cross on (b) the unit vector that is colinear with the vector connecting the winner line endpoints) is calculated by simply using the dot product. Therefore, it can be either a positive or negative sign. Equation (13) shows how to calculate the score for the example of Figure 3. This equation is designed based on a simple linear interpolation.

S c o r e = \{\begin{array}{l} 0 & d > \frac{D}{100 - S_{w_{3}}} \\ 100 & d < 0 \\ 100 - (d / D) (100 - S_{w_{3}}) & O t h e r w i s e \end{array}

(13)

In Equation (13),

S_{w_{3}}

is the score of the selected training sample with a score of less than 100 in Figure 3, and the line connected to it has the smallest distance to the yellow cross. Therefore, in other similar situations,

S_{w_{3}}

will be replaced by the selected training sample score. According to this figure, if the sign of “

d

” is negative, it shows the yellow cross lies before the red star (perfect execution of exercise), and its score is supposed to be 100.

3. Results and Discussion

The proposed method has been evaluated in two datasets. The first dataset was created by the authors, and the second one was the reference data used in [24]. The dataset produced by us included nine exercise motions, and each movement was performed 10 times, for 90 repetitions in total, in front of the Kinect sensor. Nine short videos are available in the Supplementary Materials that show the performance of each exercise movement.

In creating the first dataset, three experienced physical coaches (two men and a woman) were engaged to execute each exercise correctly, repeating each action twice. In total, eight participants took part in the data collection, and the details are presented in Table 2.

The total length of the depth video in the first dataset was 245 min, of which a total of 116 min was dedicated to target exercises, and at other times, unrelated motions were performed in front of the Kinect sensor. The Kinect sensor was mounted at a suitable distance and orientation when taking depth video from the players. These conditions included a straight-ahead and full-height vision of the player’s body in the captured depth videos. The general environmental conditions of recording depth images were also the same as the computer games, controlled by Kinect. Generally, the presence of intense light in the environment deteriorates the matching process in the Kinect depth sensor, and the environment and players must not be covered with light-absorbing (dark) materials.

The second dataset also consisted of 297 labeled distinct sequences of 3D skeletal nodes that were categorized into nine different exercise performances. The labeling performance of the proposed method has been evaluated at two levels. At first, by changing its adjusting parameters, their effects were investigated on the results of the labeling process. At the second level, the performance of the proposed method was compared with the method developed in [24], which dataset has been used in our evaluations. The adjusting parameters of the proposed method are introduced in Table 3.

According to Table 3, a total of 100 modes of implementation of the proposed method with different adjusting parameters were designed to evaluate their effects on the labeling process. In each set of adjusting parameters, the accuracy of the labeling process was measured by indices of the overall accuracy and the Kappa coefficient extracted from the confusion matrix. To do so, the true labels of the physical exercises were assigned throughout the whole dataset and then compared with the predicted labels by the proposed method. The graphs presented in Figure 4 show the overall accuracy and Kappa coefficients of all 100 different sets of adjusting parameters in implementing the proposed method.

According to Figure 4, up to 30% differences in overall accuracies were observed when changing the adjusting parameters, which shows the importance of finding the best setting to achieve optimal results. Another inferable result in these graphs is the small sensitivity of the results to the change in the sampling rate of the depth videos (

f p s

), as if good labeling results can be achieved even with 4

f p s

videos. However, increasing the length of the video sequence used in the labeling process has led to a decreasing trend in overall accuracies. It seems the limited temporal length of each exercise performance and the possibility of their non-successive repeating in the recorded depth videos are the reasons for the decreasing trend of accuracies by increasing the length of the sequences. However, the sequences with lengths of 5 to 10 s did not have a significant change in decreasing the accuracies. Therefore, choosing such lengths for the sequences is recommended in the labeling process. Based on the obtained results, the other inferred advantage of the proposed method could be the low number of descriptors required for recognition of each physical exercise. As shown in Figure 4, the best results have been obtained with a minimal selection of descriptors (e.g., two descriptors for each exercise). This can be one of the strengths of this method compared to the ones that use neural networks or many complex descriptors in the labeling process like [24,34]. Reducing the number of generated descriptors in the labeling process can directly impact the computational costs and facilitate the possibility of obtaining real-time solutions in low-cost processing units. From the functional transparency point of view, developing such a method with clear steps is more interpretable than the neural network approaches with the black box nature, irrespective of their pragmatic efficiencies. The confusion matrix of the best-achieved result of the proposed method in the first dataset is presented in Figure 5.

As can be seen in the confusion matrix reported in Figure 5, the main uncertainties were related to the unclassified class that has a large diversity. Some of the unclassified sequences were also related to the time interval that the sequence of skeletal nodes was not completely located at during the exercise movement execution.

Next, we compared the results obtained from the proposed method using the best setting parameters in the second dataset with the results of [24] implemented on the same dataset. In [24], two deep learned neural networks named FF-AAGCN and DD-Net were used to label the nine actions in the sequence of 3D skeletal nodes. The confusion matrices of the proposed method and the mentioned comparison methods are presented in Figure 6.

By comparing the confusion matrices in Figure 6, the competitive accuracies of the proposed method can be seen. Note that the number of training and test samples used to achieve these results in the proposed method was selected similar to the ones used in the comparison methods. These were 55% train and 45% test. This demonstrated that the proposed method, with its simple and comprehended steps, can do what a complex deep-learned neural network can.

Our next evaluation was dedicated to the scoring process. As described in Section 2.6, one CC descriptor vector related to the perfect execution of each class and at least one sample with imperfect execution (score less than 100) are needed in the scoring process. Therefore, the only adjusting parameter of the scoring process is the number of imperfectly executed training samples with scores of less than 100. In evaluating the scoring process, the role of this adjusting parameter has been measured.

After determining the label of each sequence of the 3D skeletal nodes, its score can be calculated through the training samples of the winner label. We could not find any reference dataset with coach scores for 3D skeletal sequences for cross-validation. Therefore, we exclusively utilized our carefully prepared dataset, in which each exercise performance had been scored by an expert coach. The scoring results of the proposed method with the coach grades could be easily compared. Note that, in this evaluation, only the sequences were used that were completely in the time frame of doing the exercise motions. In other words, if a sequence contained the execution of more than one exercise, it was considered a mixed sequence that was not used in the evaluation process of scoring performance. This was done to eliminate these mixed sequences which CC descriptors were not purely dedicated to a unique exercise motion. Obviously, scoring the mixed sequences will be ambiguous, and we do not have any truth values for their comparisons. The scatterplots represented in Figure 7 show the predicted scores compared to their correct values for the pure sequences of our dataset. Each scatterplot represents the results of the scoring process with a different number of imperfect training samples.

As seen in the scatterplots of Figure 7, increasing the number of imperfect training samples with scores less than 100 can be effective in improving the accuracy of the scoring process. However, there is no significant difference between the selection of three and five imperfect training samples. Therefore, it seems that the availability of at least three training samples with scores less than 100 can achieve favorable results in the scoring process. The slopes of the fitted line (near 1) and the R² to each scatterplot indicate the low level of remaining bias in the scoring step. The Root Mean Square Errors (RMSEs) of the scatterplots represented in Figure 7 are also reported in Table 4.

In general, the proposed scoring method is a mapping of measurable quantities from the descriptive vectors of the sports coach’s linguistic scoring. The separation of the results reported in the graphs of Figure 7 into factors, such as the gender, height of the players, and the type of exercise movement, did not lead to meaningful and comparable results. This shows that, if the correct label is assigned to an exercise movement, the scoring process can be matched with the points obtained by a sports coach independently from these factors. The authors believe that the extracted features (angular and NDDI features) may have the potential to overcome these types of uncertainty due to their normalizing nature.

4. Conclusions

We proposed a method for continuously classifying and scoring the captured videos of physical exercises through the Kinect depth sensor. Our method was based on extracting relative descriptors between features extracted from 3D skeletal nodes. The labeling of video sequences was performed using adaptive CEM kernels and scoring through interpolation among CC descriptors. The results demonstrated that our method could achieve the proper accuracies in the labeling process, and its scoring results were close to ones derived by sports coaches. This idea can be used to develop an interactive computer game for game consoles with the aim of teaching dance or physical exercises. Essential changes that should be considered in this method when there is more than one depth sensor in capturing 3D skeletal nodes could also be considered in future investigations.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/s24206713/s1, Video S1: Performances of the physical exercises of 1 to 9 in the first dataset.

Author Contributions

Conceptualization, R.E., M.J.V.Z., A.S. and E.G.; methodology, R.E. and A.S.; software, R.E. and A.S.; validation, M.J.V.Z. and A.S.; formal analysis, R.E. and A.S.; investigation, R.E. and A.S.; data curation, R.E.; writing—original draft preparation, A.S.; writing—review and editing, A.S., M.J.V.Z. and E.G.; visualization, R.E. and A.S.; supervision, M.J.V.Z., A.S. and E.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was received from all participants involved in this study.

Data Availability Statement

The datasets used in this research will be available upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pang, X.; Liang, B. Motion recognition based on Kinect for human-computer intelligent interaction. J. Phys. Conf. Ser. 2019, 1187, 032028. [Google Scholar] [CrossRef]
Marín-Lora, C.; Chover, M.; Martín, M.Y.; García-Rytman, L. Creating a treadmill running video game with smartwatch interaction. Multimed. Tools Appl. 2024, 83, 57709–57729. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, Z.; Chang, Y.; Aziz, E.-S.; Esche, S.; Chassapis, C. Recent Developments in Game-Based Virtual Reality Educational Laboratories Using the Microsoft Kinect. Int. J. Emerg. Technol. Learn. 2018, 13, 138–159. [Google Scholar] [CrossRef]
Arena, F.; Collotta, M.; Pau, G.; Termine, F. An overview of augmented reality. Computers 2022, 11, 28. [Google Scholar] [CrossRef]
Yue, H.; Chen, W.; Wu, X.; Liu, J. Fast 3D modeling in complex environments using a single Kinect sensor. Opt. Lasers Eng. 2014, 53, 104–111. [Google Scholar] [CrossRef]
Xu, M.; Zhai, Y.; Guo, Y.; Lv, P.; Li, Y.; Wang, M.; Zhou, B. Personalized training through Kinect-based games for physical education. J. Vis. Commun. Image Represent. 2019, 62, 394–401. [Google Scholar] [CrossRef]
Tipton, C.C.; Telfer, S.; Cherones, A.; Gee, A.O.; Kweon, C.Y. The use of Microsoft Kinect ™ for assessing readiness of return to sport and injury risk exercises: A validation study. Int. J. Sports Phys. Ther. 2019, 14, 724–730. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Limbu, B.H.; Jarodzka, H.; Klemke, R.; Specht, M. Using sensors and augmented reality to train apprentices using recorded expert performance: A systematic literature review. Educ. Res. Rev. 2018, 25, 1–22. [Google Scholar] [CrossRef]
Hernández-Aceituno, J.; Arnay, R.; Toledo, J.; Acosta, L. Using Kinect on an Autonomous Vehicle for Outdoors Obstacle Detection. IEEE Sens. J. 2016, 16, 3603–3610. [Google Scholar] [CrossRef]
Suo, X.; Tang, W.; Li, Z. Motion Capture Technology in Sports Scenarios: A Survey. Sensors 2024, 24, 2947. [Google Scholar] [CrossRef]
Cossich, V.R.; Carlgren, D.; Holash, R.J.; Katz, L. Technological breakthroughs in sport: Current practice and future potential of artificial intelligence, virtual reality, augmented reality, and modern data visualization in performance analysis. Appl. Sci. 2023, 13, 12965. [Google Scholar] [CrossRef]
Rahman, S.; Sarker, S.; Haque, A.N.; Uttsha, M.M.; Islam, M.F.; Deb, S. AI-driven stroke rehabilitation systems and assessment: A systematic review. IEEE Trans. Neural Syst. Rehabil. Eng. 2022, 13, 192–207. [Google Scholar] [CrossRef]
Sasaki, K.; Masui, S.; Tezuka, K. 3D sensing technology for real-time quantification of athletes’ movements. Fujitsu Sci. Tech. J. 2018, 54, 8–16. [Google Scholar]
Li, J.; Gao, W.; Wu, Y.; Liu, Y.; Shen, Y. High-quality indoor scene 3D reconstruction with RGB-D cameras: A brief review. Comput. Vis. Media 2022, 8, 369–393. [Google Scholar] [CrossRef]
Lyu, Z. State-of-the-art human-computer-interaction in metaverse. Int. J. Hum. Comput. Interact. 2023, 1–19. [Google Scholar] [CrossRef]
Ain, Q.U.; Khan, S.; Ilyas, S.; Yaseen, A.; Tariq, I.; Liu, T.; Wang, J. Additional Effects of Xbox Kinect Training on Upper Limb Function in Chronic Stroke Patients: A Randomized Control Trial. Healthcare 2021, 9, 242. [Google Scholar] [CrossRef]
Sarbolandi, H.; Lefloch, D.; Kolb, A. Kinect range sensing: Structured-light versus Time-of-Flight Kinect. Comput. Vis. Image Underst. 2015, 139, 1–20. [Google Scholar] [CrossRef]
Tölgyessy, M.; Dekan, M.; Chovanec, Ľ. Skeleton tracking accuracy and precision evaluation of kinect v1, kinect v2, and the azure Kinect. Appl. Sci. 2021, 11, 5756. [Google Scholar] [CrossRef]
Newcombe, R.A.; Izadi, S.; Hiliges, O.; Molyneaux, D.; Kim, D.; Davison, A.J.; Kohi, P.; Shotton, J.; Hodges, S.; Fitzgibbon, A. KinectFusion: Real-time dense surface mapping and tracking. In Proceedings of the 10th IEEE International Symposium on Mixed and Augmented Reality, Basel, Switzerland, 26–29 October 2011; pp. 127–136. [Google Scholar] [CrossRef]
Du, R.; Turner, E.; Dzitsiuk, M.; Prasso, L.; Duarte, I.; Dourgarian, J.; Afonso, J.; Pascoal, J.; Gladstone, J.; Cruces, N. DepthLab: Real-time 3D interaction with depth maps for mobile augmented reality. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, Virtual Event USA, 20–23 October 2020; pp. 829–843. [Google Scholar] [CrossRef]
Budakov, P. 3D-rendered images and their application in the interior design. In Encyclopedia of Computer Graphics and Games; Springer International Publishing: Cham, Switzerland, 2024; pp. 62–70. [Google Scholar] [CrossRef]
Adikari, S.B.; Ganegoda, N.C.; Meegama, R.G.; Wanniarachchi, I.L. Applicability of a Single Depth Sensor in Real-Time 3D Clothes Simulation: Augmented Reality Virtual Dressing Room Using Kinect Sensor. Adv. Hum. Comput. Interact. 2020, 2020, 1314598. [Google Scholar] [CrossRef]
Rallis, I.; Voulodimos, A.; Bakalos, N.; Protopapadakis, E.; Doulamis, N.; Doulamis, A. Machine learning for intangible cultural heritage: A review of techniques on dance analysis. In Visual Computing for Cultural Heritage; Springer: Cham, Switzerland, 2020; pp. 103–119. [Google Scholar] [CrossRef]
Pham, Q.T.; Nguyen, D.A.; Nguyen, T.T.; Nguyen, T.N.; Nguyen, D.T.; Pham, D.T.; Tran, T.H.; Le, T.-L.; Vu, H. A study on skeleton-based action recognition and its application to physical exercise recognition. In Proceedings of the 11th International Symposium on Information and Communication Technology, Association for Computing Machinery, Hanoi, Vietnam, 1–3 December 2022; pp. 239–246. [Google Scholar] [CrossRef]
Sideridou, M.; Kouidi, E.; Hatzitaki, V.; Chouvarda, I. Towards Automating Personal Exercise Assessment and Guidance with Affordable Mobile Technology. Sensors 2024, 24, 2037. [Google Scholar] [CrossRef]
Shotton, J.; Fitzgibbon, A.; Cook, M.; Sharp, T.; Finocchio, M.; Moore, R.; Kipman, A.; Blake, A. Real-time human pose recognition in parts from single depth images. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 1297–1304. [Google Scholar] [CrossRef]
Yeung, K.Y.; Kwok, T.H.; Wang, C.C.L. Improved Skeleton Tracking by Duplex Kinects: A Practical Approach for Real-Time Applications. J. Comput. Inf. Sci. Eng. 2013, 13, 041007. [Google Scholar] [CrossRef]
Cai, X.; Li, S.; Zhao, L.; Liu, Z.; Ye, Q. Efficient Skeleton Extraction Method Based on Depth Data in Infrared Self-help Camera System. In Advanced Multimedia and Ubiquitous Engineering; Park, J., Jin, H., Jeong, Y.S., Khan, M., Eds.; Lecture Notes in Electrical Engineering; Springer: Singapore, 2016; Volume 393, pp. 601–608. [Google Scholar] [CrossRef]
Kuang, H.; Cai, S.; Ma, X.; Liu, X. An Effective Skeleton Extraction Method Based on Kinect Depth Image. In Proceedings of the 2018 10th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA), Changsha, China, 10–11 February 2018; pp. 187–190. [Google Scholar] [CrossRef]
Li, R.; Si, W.; Weinmann, M.; Klein, R. Constraint-Based Optimized Human Skeleton Extraction from Single-Depth Camera. Sensors 2019, 19, 2604. [Google Scholar] [CrossRef]
Wang, J.; Liu, Z.; Wu, Y.; Yuan, J. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1290–1297. [Google Scholar] [CrossRef]
Zhao, X.; Li, X.; Pang, C.; Zhu, X.; Sheng, Q.Z. Online human gesture recognition from motion data streams. In Proceedings of the 21st ACM international conference on Multimedia (MM ‘13). Association for Computing Machinery, Barcelona, Spain, 21–25 October 2013; pp. 23–32. [Google Scholar] [CrossRef]
Wang, J.; Wu, Y. Learning Maximum Margin Temporal Warping for Action Recognition. In Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 1–8 December 2013; pp. 2688–2695. [Google Scholar] [CrossRef]
Kerola, T.; Inoue, N.; Shinoda, K. Cross-view human action recognition from depth maps using spectral graph sequences. Comput. Vis. Image Underst. 2017, 154, 108–126. [Google Scholar] [CrossRef]
Chiang, A.-T.; Chen, Q.; Wang, Y.; Fu, M.R. Kinect-Based In-Home Exercise System for Lymphatic Health and Lymphedema Intervention. IEEE J. Transl. Eng. Health Med. 2018, 6, 4100313. [Google Scholar] [CrossRef]
Salim, H.A.; Alaziz, M.; Abdalla, T.Y. Human Activity Recognition Using the Human Skeleton Provided by Kinect. Iraqi J. Electr. Electron. Eng. 2021, 17, 183–189. [Google Scholar] [CrossRef]
Li, C.; Xie, C.; Zhang, B.; Han, J.; Zhen, X.; Chen, J. Memory Attention Networks for Skeleton-Based Action Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 4800–4814. [Google Scholar] [CrossRef]
Haghi, M.; Ershadi, A.; Deserno, T.M. Recognizing Human Activity of Daily Living Using a Flexible Wearable for 3D Spine Pose Tracking. Sensors 2023, 23, 2066. [Google Scholar] [CrossRef]
Ghaderpour, E.; Mazzanti, P.; Bozzano, F.; Scarascia Mugnozza, G. Trend Analysis of MODIS Land Surface Temperature and Land Cover in Central Italy. Land 2024, 13, 796. [Google Scholar] [CrossRef]
Murari, A.; Rossi, R.; Lungaroni, M.; Gaudio, P.; Gelfusa, M. Quantifying Total Influence between Variables with Information Theoretic and Machine Learning Techniques. Entropy 2020, 22, 141. [Google Scholar] [CrossRef]
Jae-Chern, Y.; Han, T.H. Fast normalized cross-correlation. Circuits Syst. Signal Process. 2009, 28, 819–843. [Google Scholar] [CrossRef]
Jiang, A.H.; Huang, X.C.; Zhang, Z.H.; Li, J.; Zhang, Z.Y.; Hua, H.X. Mutual information algorithms. Mech. Syst. Signal Process. 2010, 24, 2947–2960. [Google Scholar] [CrossRef]
Yang, X.; Zhao, M.; Shi, S.; Chen, J. Deep constrained energy minimization for hyperspectral target detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8049–8063. [Google Scholar] [CrossRef]
Chang, C.-I. Constrained Energy Minimization (CEM) for Hyperspectral Target Detection: Theory and Generalizations. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5522921. [Google Scholar] [CrossRef]

Figure 1. Skeletal joint points extracted from a human depth image.

Figure 2. Flowchart of the proposed method for recognizing and scoring physical exercises.

Figure 3. A schematic drawing of the scoring process in the CC descriptors space.

Figure 4. Overall accuracy and Kappa coefficient of the labeling by changing the adjusting parameters.

Figure 5. Confusion matrix of the best result achieved by the proposed method in the first dataset.

Figure 6. The confusion matrix of the proposed method and the results obtained from [24] in the second dataset.

Figure 7. Scatterplots of the scoring process for different numbers of imperfect training samples.

Table 1. Angular features and NDDIs extracted from skeletal nodes.

$0^{\circ} < α_{t} (p_{i}, p_{v}, p_{j}) < 180^{\circ}$								$- 1 < N D D I_{t} (p_{i}, p_{j}, p_{m}, p_{n}) < 1$
$f$	$i$	$v$	$j$	$f$	$i$	$v$	j	$f$	$i$	$j$	$m$	$n$
$f_{1} (t)$	2	5	13	$f_{14} (t)$	1	16	4	$f_{27} (t)$	1	4	2	3
$f_{2} (t)$	1	6	14	$f_{15} (t)$	2	16	4	$f_{28} (t)$	1	2	4	3
$f_{3} (t)$	3	10	8	$f_{16} (t)$	1	16	3	$f_{29} (t)$	6	11	5	10
$f_{4} (t)$	4	11	9	$f_{17} (t)$	5	12	16	$f_{30} (t)$	6	5	11	10
$f_{5} (t)$	6	16	11	$f_{18} (t)$	6	12	16	$f_{31} (t)$	5	10	7	16
$f_{6} (t)$	5	16	10	$f_{19} (t)$	10	16	12	$f_{32} (t)$	6	11	7	16
$f_{7} (t)$	5	16	11	$f_{20} (t)$	11	16	12	$f_{33} (t)$	2	10	7	16
$f_{8} (t)$	6	16	10	$f_{21} (t)$	2	12	16	$f_{34} (t)$	1	11	7	16
$f_{9} (t)$	11	16	10	$f_{22} (t)$	1	12	16	$f_{35} (t)$	2	3	7	16
$f_{10} (t)$	6	12	5	$f_{23} (t)$	3	16	12	$f_{36} (t)$	1	4	7	16
$f_{11} (t)$	2	12	1	$f_{24} (t)$	4	16	12
$f_{12} (t)$	3	16	4	$f_{25} (t)$	12	5	1
$f_{13} (t)$	2	16	3	$f_{26} (t)$	12	6	2

Table 2. Specifications and roles of the participants involved in generating the first dataset.

Role	Gender	Age (Years)	Hight (cm)	Weight (kg)
Coach	Female	27	167	57
Participant	Female	35	163	65
Participant	Female	45	160	70
Participant	Male	19	180	75
Participant	Male	26	185	88
Coach	Male	30	178	68
Participant	Male	42	179	90
Participant	Male	60	173	80

Table 3. Adjusting parameters used for evaluation of the proposed method.

Parameters	Description	Set Values
$∆ t$	The sequence length	5, 7, 10, 15 and 20 s
$d t$	Depth video rate	30 fps, 10 fps, 6 fps and 4.3 fps
$n D$	Number of selected MI descriptors for each exercise	2, 4, 8, 12 and 20

Table 4. The RMSEs of the scoring step.

Number of Imperfect Samples	RMSEs (Points)
1	10.9
2	7.8
3	2.4
5	1.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Esmaeeli, R.; Valadan Zoej, M.J.; Safdarinezhad, A.; Ghaderpour, E. Recognition and Scoring Physical Exercises via Temporal and Relative Analysis of Skeleton Nodes Extracted from the Kinect Sensor. Sensors 2024, 24, 6713. https://doi.org/10.3390/s24206713

AMA Style

Esmaeeli R, Valadan Zoej MJ, Safdarinezhad A, Ghaderpour E. Recognition and Scoring Physical Exercises via Temporal and Relative Analysis of Skeleton Nodes Extracted from the Kinect Sensor. Sensors. 2024; 24(20):6713. https://doi.org/10.3390/s24206713

Chicago/Turabian Style

Esmaeeli, Raana, Mohammad Javad Valadan Zoej, Alireza Safdarinezhad, and Ebrahim Ghaderpour. 2024. "Recognition and Scoring Physical Exercises via Temporal and Relative Analysis of Skeleton Nodes Extracted from the Kinect Sensor" Sensors 24, no. 20: 6713. https://doi.org/10.3390/s24206713

APA Style

Esmaeeli, R., Valadan Zoej, M. J., Safdarinezhad, A., & Ghaderpour, E. (2024). Recognition and Scoring Physical Exercises via Temporal and Relative Analysis of Skeleton Nodes Extracted from the Kinect Sensor. Sensors, 24(20), 6713. https://doi.org/10.3390/s24206713

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Recognition and Scoring Physical Exercises via Temporal and Relative Analysis of Skeleton Nodes Extracted from the Kinect Sensor

Abstract

1. Introduction

2. Materials and Methods

2.1. Angular and NDDI Features Extraction

2.2. Generating the CC and MI Descriptors

2.3. MI Descriptor Selection for Each Exercise Motion

2.4. Estimation of the Adaptive CEM Kernels

2.5. Labeling of Unknown Sequences

2.6. Scoring on the Performance of Exercises

3. Results and Discussion

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI