Next Article in Journal
CLUMM: Contrastive Learning for Unobtrusive Motion Monitoring
Previous Article in Journal
A Comprehensive Review of Advanced Lactate Biosensor Materials, Methods, and Applications in Modern Healthcare
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Analysis of Kinect-Based Human Motion Capture Accuracy Using Skeletal Cosine Similarity Metrics

by
Wenchuan Jia
,
Hanyang Wang
,
Qi Chen
,
Tianxu Bao
and
Yi Sun
*
School of Mechatronic Engineering and Automation, Shanghai University, Shanghai 200444, China
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(4), 1047; https://doi.org/10.3390/s25041047
Submission received: 27 December 2024 / Revised: 1 February 2025 / Accepted: 7 February 2025 / Published: 10 February 2025
(This article belongs to the Section Sensors and Robotics)

Abstract

:
Kinect, with its intrinsic and accessible human motion capture capabilities, found widespread application in real-world scenarios such as rehabilitation therapy and robot control. Consequently, a thorough analysis of its previously under-examined motion capture accuracy is of paramount importance to mitigate the risks potentially arising from recognition errors in practical applications. This study employs a high-precision, marker-based motion capture system to generate ground truth human pose data, enabling an evaluation of Azure Kinect’s performance across a spectrum of tasks, which include both static postures and dynamic movement behaviors. Specifically, the cosine similarity for skeletal representation is employed to assess pose estimation accuracy from an application-centric perspective. Experimental results reveal that factors such as the subject’s distance and orientation relative to the Kinect, as well as self-occlusion, exert a significant influence on the fidelity of Azure Kinect’s human posture recognition. Optimal testing recommendations are derived based on the observed trends. Furthermore, a linear fitting analysis between the ground truth data and Azure Kinect’s output suggests the potential for performance optimization under specific conditions. This research provides valuable insights for the informed deployment of Kinect in applications demanding high-precision motion recognition.

1. Introduction

The continuous advancement of motion capture technology propelled its application far beyond the realms of film production and game design, with notable adoption in domains such as medical rehabilitation, virtual reality, and robot control [1,2,3]. In particular, markerless vision-based motion capture devices, owing to their low hardware costs and ease of use [4], demonstrate a broader application potential compared to traditional marker-based systems [5]. Undoubtedly, Microsoft’s Azure Kinect DK series stands as a representative example of markerless motion capture devices [6], distinguished not only by its compact design and open software framework, but also by its superior image modulation frequency, depth data accuracy, and its inherent human skeletal joint recognition capabilities [7,8]. However, as is characteristic of markerless motion capture devices, the correctness and precision of its recognition are susceptible to factors such as illumination variations and occlusions in Kinect’s field of view.
In the film industry, motion capture data are typically acquired offline, allowing for the mitigation of data deviations through iterative acquisition or heuristic adjustments based on experience and intuition. In interactive gaming, motion capture data are processed in real-time, but inaccuracies often remain unnoticed unless they manifest as egregious errors. However, in real-world applications, such as medical rehabilitation, the implications of motion capture data deviations become critical and can no longer be disregarded, as they directly influence the effectiveness of the application and potentially introduce significant risks. For Kinect, as the applications increasingly diversify and become more practical, a rigorous evaluation of its accuracy becomes ever more paramount.
For instance, Amprimo et al. employed the Azure Kinect for the remote assessment, detection, and rehabilitation of Parkinson’s disease patients [9]. This same sensor has also been utilized for classifying and detecting individuals with depression [10] and for monitoring geriatric clinical conditions [11], collectively demonstrating its popularity in clinical healthcare. Bärligea et al. leveraged the Azure Kinect as a motion-tracking signal generator for weightlessness simulation, showcasing its potential in manned space mission planning and other aerospace applications [12]. By integrating the Kinect with other sensing devices and control techniques, its range of applications can be further expanded. Examples include combining it with robot models to achieve collision-free human–robot interaction [13], integrating it with IMUs for more stable human motion measurement [14], and introducing deep learning techniques to enable upper limb functional assessment using a single Kinect v2 sensor [15].
Various methods have been proposed to enhance Kinect’s motion capture accuracy. For instance, a Support Vector Machine (SVM) classifier was employed to improve the accuracy of human action recognition [16]. Park et al. designed a whole-body driven scanner consisting of three Kinect v2 sensors, mitigating the time cost and privacy concerns associated with wearing tight-fitting clothing in traditional 3D whole-body scanning while achieving high prediction accuracy [17].
Since the parallel use of Kinect sensors can enhance accuracy, various multi-sensor approaches have been investigated. One study proposed using two Kinect sensors to acquire upper limb joint angle data from different perspectives, resulting in significantly improved accuracy and robustness of joint angle trajectory recognition [18]. Another study combined data from multiple Azure Kinect sensors, transforming depth information into 3D positions to mitigate occlusion issues inherent in single-camera setups [19]. A data fusion algorithm for three Kinect devices was developed to enhance the accuracy of human skeletal tracking [20]. Furthermore, a spatiotemporal calibration method for multiple cameras was proposed to maximize coverage of the captured subject and minimize occlusions, leading to a substantial improvement in the accuracy of Azure Kinect motion capture [21].
However, considering the practical advantages of deploying a single Kinect and that using multiple Kinect sensors somewhat contradicts the original intent of its ease of use, there has been a growing body of research focusing on the motion capture performance of a single Kinect. Yeunga et al. compared the accuracy of gait tracking using Azure Kinect, Kinect v2, and Orbbec Astra across five camera viewpoints, demonstrating the superior performance of Azure Kinect in tracking hip and knee joints in the sagittal plane [22]. Bilesan discussed how to use inverse kinematics to more accurately convert spatial node data obtained from a single Kinect into joint angle data [23] and evaluated its effectiveness using a real robot [24]. Beshara, through clinical experiments based on multiple testing devices, demonstrated that Kinect exhibits high reliability in shoulder range of motion (ROM) measurements [25]. The depth and spatial accuracy of the Azure Kinect DK have also been compared with those of its predecessors, Kinect v1 and Kinect v2, further validating its advantages in 3D scanning applications [6,26]. Büker et al. evaluated the repeatability of the Azure Kinect, analyzing the spatiotemporal progression and differences in derived parameters from 100 body tracking experiments, revealing significant variations in joint positions under different processing modes and thus recommending careful selection of processing modes and consistent use of the same computer hardware for all analyses in practical applications [27].
Evidently, even the latest Azure Kinect DK device does not yet represent a sufficiently precise motion capture apparatus. Nevertheless, considering the significant advantages and continuous improvement [28,29,30,31,32,33,34] of such markerless motion capture systems in practical applications, as evidenced by the research of Martiš [29], Antico [32], and Milosevic [33] et al., which demonstrates that markerless recognition technologies can significantly reduce assessment time in clinical medicine, the advantages and potential of markerless techniques in clinical settings are undeniable when compared to traditional marker-based systems (e.g., Vicon) and methods. This compels us to engage in a deeper consideration of the following question:
(1) What is the precise accuracy of full-body human motion capture using a single Kinect device? Specifically, how can we design a suitable motion capture data accuracy evaluation scheme that considers its potential practical application requirements, thereby providing reference accuracy data and potential guidelines for its application development? It is pertinent to note that existing accuracy analyses and evaluation methods are primarily categorized into two types. The first stems from image analysis techniques, generally emphasizing the spatial position of joint nodes rather than the morphology of the skeletal system itself. The second directly compares joint rotation angles [23,34], while neglecting positional accuracy. While we acknowledge that discussions regarding the fidelity of joint rotation angle tracking have become increasingly important with the advancement of research on human-like motion, the accuracy of joint position remains crucial, particularly in tasks requiring spatial positioning. Therefore, devising a new and appropriate accuracy evaluation method that represents a compromise between these two types of accuracy (joint angle and joint position) is a worthwhile and practical research direction.
(2) Under the current technological constraints of a single Azure Kinect DK device, what is the potential for subsequent improvement and enhancement of its motion capture data? After all, existing accuracy discussions primarily focus on its previous versions, such as Kinect v2.
We contend that these two questions are pivotal for the deeper application of Kinect in the future. However, to the best of our knowledge, there is currently a paucity of comprehensive research in this area; despite their fundamental importance, most studies continue to focus on showcasing the application cases and potential of Kinect in a wider range of fields. Therefore, this paper addresses these two questions to serve as a valuable supplement to the existing research on the accuracy of Kinect.
In this work, we adopt the concept of cosine similarity between vectors to analyze and evaluate the spatial directional consistency between skeletal recognition results and their ground truth. Since this metric is directly related to the spatial position accuracy of the joints and can be further correlated with the accuracy of joint angles, it combines the advantages of both types of accuracy analysis metrics. Based on a cosine similarity analysis of both local body segments and the entire skeleton, our findings reveal that Azure Kinect DK achieves high skeletal recognition accuracy for stationary postures, with a distinguishable decline in accuracy during motion. When the distance and orientation between the human body and the Kinect camera are maintained within a certain range, the Kinect demonstrates good recognition accuracy and stability, while, as these parameters deviate from this range, the stability of recognition decreases significantly, and the accuracy consequently declines. For individual skeletal segments, the accuracy of distal limb segments is generally lower than that of their proximal counterparts, although exceptions occur under specific conditions, which are analyzed in detail in this study. Furthermore, attempts to post-process and correct motion capture data based on the ground truth data suggest that the accuracy of Azure Kinect DK’s motion capture can be further improved under certain conditions. These novel experimental findings provide practical references and recommendations for the in-depth application of this popular device in tasks requiring high-precision motion recognition.

2. Materials and Methods

2.1. Experimental Setup

Two identical Azure Kinect DK devices (hereinafter referred to as Kinect), model 1880, with color camera firmware version 1.6.11 and depth camera firmware version 1.6.79, were selected for the evaluation tests. It is important to clarify that these devices were not utilized concurrently; rather, one served as a control to assess the generalizability of the acquired data.
Furthermore, considering that marker-based motion capture systems are capable of acquiring high-precision spatial positioning data, a Visualeyez system manufactured by Phoenix Technologies was employed to accurately acquire the spatial pose data of markers and subsequently generate human motion pose data that could serve as ground truth. This device, specifically the Visualeyez III VZ10K PTI 3D Motion Capture system (hereinafter referred to as Visualeyez), is an active marker-based tracking system [35,36]. It features a maximum sampling frequency of 10,000 Hz and a sampling accuracy of 0.1 mm. With a spatial resolution of 15 μm, a square field of view (FOV) of 100°, and a latency of 0.3 ms, it is capable of supporting high-precision dynamic capture [37].
In practical tests, the coordinate systems of different devices need to be aligned to facilitate the analysis of motion capture data. As illustrated in Figure 1a, the origin of the default coordinate system for Kinect is defined at the center point of its depth camera [38], with the depth direction as the positive z-axis, the positive x-axis pointing to the right, and the positive y-axis pointing downward. The origin of the data coordinate system for the Visualeyez device is located at the center of its central capture lens [35], with the depth direction as the positive z-axis, the positive x-axis pointing upward, and the positive y-axis pointing to the right. During the testing process, the z-axes of both devices were positioned in the same vertical plane and maintained parallel to each other, and the YOZ plane of the Kinect devices and the XOZ plane of Visualeyez were made coplanar.
In the experiments conducted in this study, both Kinect and Visualeyez were used to simultaneously measure the subjects. For the acquired data, they were first transformed into a unified coordinate system. Subsequently, the data acquired from Visualeyez were used as a benchmark to compare and analyze the data obtained from Kinect.

2.2. Experimental Scenarios

To comprehensively evaluate the accuracy of human pose data identified by the Kinect, we categorized the testing tasks into static and dynamic scenarios from an application perspective. In the static tests, a mannequin with a height of 1.85 m was selected as the sole subject. This mannequin closely approximates the full-body morphology of a human and can maintain a static posture, as visually compared to a real human in Figure 1b. Consequently, static tests enable the repeatable and precise reproduction of the human pose to be measured, facilitating the direct assessment of discrepancies between motion capture data and ground truth data. This setup was also used to investigate how factors such as the subject’s position, orientation, and posture affect the motion capture accuracy and stability. In the dynamic tests, human subjects performed a variety of pre-defined actions in sequence to evaluate the overall accuracy of the Kinect in continuous motion capture.

2.2.1. Static Test Protocol

The specific tasks and scenarios for the static tests are illustrated in Figure 1c. In these tests, the mannequin was configured in a stable, upright standing posture, further categorized into three distinct poses based on upper limb configuration: standing at attention, arms swinging anteroposteriorly, and arms outstretched laterally. These poses were selected to represent common and typical human postures.
The relative positioning between the subject and the Kinect was limited to the most prevalent application scenario: the subject positioned directly in front of the Kinect. Considering the recognition range of Kinect, the subject was maintained within a distance range of 1500 mm to 4000 mm directly in front of the device. Distances less than 1500 mm could result in the hands exceeding the effective recognition range. Within this range, static tests with the subject directly facing the Kinect were conducted at 100 mm intervals. For tests where the subject was not directly facing the Kinect, intervals of 500 mm were used. In these instances, although the subject’s position remained directly in front of the Kinect’s lens, their body orientation maintained a fixed angular deviation from the Kinect’s negative z-axis.
For scenarios where the subject’s orientation deviated from the Kinect’s negative z-axis, specific yaw angles of ±15°, ±30°, and ±45° were implemented.
The aforementioned subject-to-device distance refers specifically to the separation along the device’s Z-axis. The device’s position is defined as the origin of its coordinate system. The subject’s position is defined by the furthest protrusion of their toes relative to a circle centered at their body’s center. Therefore, when the subject stands directly in front of the device, facing it, the Z-axis distance between the toes and the device constitutes the measured distance. By rotating the subject in place, the facing direction can be adjusted while maintaining this distance. This distance was measured using a portable laser rangefinder during the experiments.
Taking into account the aforementioned factors, the specific tasks for the static tests are detailed in Table 1.

2.2.2. Dynamic Test Protocol

The specific tasks and scenarios for the dynamic tests are illustrated in Figure 1d. In these experiments, multiple human subjects, as opposed to the mannequin, participated directly to perform pre-defined actions. We did not investigate scenarios involving multiple subjects simultaneously; rather, similar to the static tests, each experiment involved only one subject present in the scene.
The pre-defined actions comprised four types: marching in place, forward and backward walking, lateral walking, and in-place body rotation. During walking, neither the speed of body movement nor arm movements were strictly controlled (though natural swinging was encouraged); subjects were instructed to act according to their personal habits, provided the task duration was satisfied. Each type of movement only had specific goals related to body movement. For example, forward and backward walking required the subject to begin at a distance of 3500 mm from Kinect, proceed forward to 1500 mm, and then retreat backward to 3500 mm. Marching in place and lateral walking tests were conducted at a series of pre-defined initial positions, with a z-axis interval of 500 mm between these positions. The specific tasks for the dynamic tests are detailed in Table 2.
The diversification of test movements was designed according to the principle of including common types of motion while maintaining the complexity of the movements within certain limits. This facilitates the comparison and evaluation of the Kinect’s performance in capturing dynamic motion.

2.3. Description and Visualization of Human Pose Data

2.3.1. Selection of Joint Nodes

The Azure Kinect DK device inherently represents the human skeleton using 32 predefined joint nodes. From these, we selected 16 nodes located on the torso and limbs for pose estimation, aiming to retain the maximum amount of human posture data while discarding unnecessary test data. For instance, nodes located on facial features and hands were excluded from our evaluation. Similarly, the clavicle node was omitted due to its proximity to the neck node. The toe node was also excluded due to its demonstrably low recognition accuracy. The final selection of 16 nodes and their corresponding experimental numbering is illustrated in Figure 2.

2.3.2. Ensuring Data Quality of Skeletal Nodes

The spatial positions of skeletal nodes serve as the foundational data for all our analyses. To ensure data quality, a multitude of detailed measures were implemented during the experiments, including:
  • Stable testing conditions: The experiments were conducted in an enclosed indoor space with minimal natural light, primarily illuminated by indoor light sources. These sources consisted of LED lights uniformly distributed across the ceiling. The walls were painted white, the floor was wood-textured, and the primary testing area featured adhesive floor markers. The lighting remained constant throughout all experiments. This setup ensured consistent diffuse lighting conditions, similar to typical well-lit indoor environments.
  • Specific design of experimental apparatus: In the experimental area, a thin, graduated carpet was fixed to the floor in front of Kinect to visually indicate the distance between any given position and the Kinect. The mannequin used in static experiments was mounted on a mobile stand equipped with counterweights, allowing for convenient and precise adjustments to its position and orientation, thereby enabling corresponding adjustments to the mannequin subject. Curtains were also employed to occlude external light sources that could potentially compromise the measurement accuracy of Visualeyez.
  • Experimental data extraction: The data acquisition frequency of the Kinect was set to 30 Hz, while that of the Visualeyez was set to 60 Hz. In static test tasks, the Kinect continuously acquired 500 frames of valid data, which were directly averaged. The Visualeyez, on the other hand, continuously acquired over 1000 frames of data, from which, after preliminary screening and filtering, 100 high-quality frames were selected for averaging. In dynamic test tasks, if the Visualeyez software momentarily failed to effectively detect a visual marker due to occlusion or other factors, the last valid data point corresponding to that marker was used as the current data record. This approach aimed to maximize the integrity of data during dynamic testing. Through this processing, the data for each skeletal node obtained from both devices were refined to minimize the influence of environmental disturbances, such as random infrared interference and occlusion.
  • Synchronization of experimental data: Compared to static tests, data acquired from the two devices in dynamic tests required further temporal synchronization. To ensure low-latency data processing, the software systems of the two testing devices were operated on separate computer systems, both of which exhibited startup delays. This presented challenges for strict hardware-based synchronization. Consequently, an offline data synchronization method based on timestamps and pose comparison was adopted. Specifically, we first selected the most recent 500 frames of valid data acquired by the Kinect and designated the first frame as the Kinect’s starting frame. Then, based on the timestamp, the Visualeyez data frame closest in time to the Kinect’s starting frame was identified. Next, within a range of 60 frames before and after this identified Visualeyez frame, the frame exhibiting the closest human pose to the starting data was selected and designated as the Visualeyez starting frame. Through this process, the starting points of the two data sequences were aligned. Subsequently, all data frames were matched based on their respective actual time intervals. The method used to assess the similarity of human posture will be described in the following section.

2.3.3. Visualization of the Human Skeleton

Based on the three-dimensional coordinates of the 16 selected joint nodes and their parent–child relationships, a three-dimensional model of the human skeleton can be rendered for explicit analysis. It is important to note that, given the selected set of joint nodes, the extremities in rendered human model correspond to wrist, ankle, and neck positions.

2.4. Motion Capture Data Correction

The spatial positions of joint nodes acquired by Visualeyez were treated as the ground truth data. The discrepancies between these ground truth data and those acquired by Kinect were defined as deviations. This section introduces a correction method for the Kinect data, predicated on the availability of ground truth data. It is essential to emphasize that this study does not delve into optimal correction methodologies or data prediction techniques. Rather, it aims to investigate the potential for post-processing the correction of the inherent deviations in Kinect data.
The correction method is detailed as follows: After transforming the Kinect data into the Visualeyez coordinate system using Equation (1), the transformed data are treated as the independent variables, and the ground truth data as the dependent variables. Fitting is performed separately for each corresponding coordinate axis component. The multivariate linear regression function in Origin 2022 software was employed to perform the fitting calculations and obtain the coefficient and intercept parameters.
x ~ k i = y k i x y ~ k i = x k i y z ~ k i = z k i z  
In Equation (1), x k i , y k i , z k i represents the raw data in the Kinect coordinate system, x ~ k i , y ~ k i , z ~ k i denotes their transformed values in the Visualeyez coordinate system, the subscript i is the joint node index, and x , y , z are the offset values, which are obtained using Equation (2),
x = y k 5 x v 5 y = x k 5 y v 5 z = z k 5 z v 5  
where x v i , y v i , z v i represents the raw data acquired by the Visualeyez, i.e., the ground truth data. The index i = 5 signifies that the data of the pelvis joint node were used for calculating the offset values. This transformation ensures that x ~ k 5 , y ~ k 5 , z ~ k 5 and x v 5 , y v 5 , z v 5 have identical values, meaning that the corrected position of the pelvis node coincides with the reference position.
Multivariate linear regression treats each joint node as an independent model. Based on the calculated coefficients a x i , a y i , a z i and the intercept parameters b x i , b y i , b z i , the relationship between the corrected data x f i , y f i , z f i and the data x ~ k i , y ~ k i , z ~ k i is further established, as shown in Equation (3).
x f i = a x i x ~ k i + b x i y f i = a y i y ~ k i + b y i z f i = a z i z ~ k i + b z i  

2.5. Evaluation Method for Motion Capture Data

For a consumer-grade motion capture system such as Kinect, directly evaluating the spatial positional accuracy of skeletal nodes does not have obvious reference significance. On the other hand, the spatial orientation of individual skeletal segments aligns more closely with human intuitive perception, and joint angles between adjacent segments are frequently employed in motion tracking tasks for humanoid characters [39,40]. Therefore, this paper utilizes the similarity of spatial orientations of skeletal segments to quantify the similarity of skeletal images and evaluate the accuracy of motion capture data. Specifically, cosine similarity is adopted as the metric to measure the discrepancy between two vectors. A cosine value approaching 1 indicates a corresponding angle approaching 0°, signifying a higher degree of similarity in the spatial orientation of the two vectors. Furthermore, we introduce weighting factors for skeletal segment cosine similarities to reflect the varying contributions of individual segment similarities to the overall skeletal similarity.
First, the matrices X ~ k , Y ~ k , Z ~ k and X v , Y v , Z v , previously arranged according to joint index order, are transformed into matrices M k and M v , and arranged according to skeletal segment order using Equation (4). Note that X ~ v , Y ~ v , Z ~ v is equivalent to X v , Y v , Z v . Subsequently, the cosine similarity C O S j for each skeletal segment and the overall skeletal cosine similarity C O S H are obtained using Equations (5) and (6), respectively, where the subscript j denotes the skeletal segment index. The average cosine similarity of the upper limb segments (forearm, upper arm, and shoulder) C O S H U and the average cosine similarity of the lower limb segments (shank, thigh, and hip) are calculated analogously to C O S H . Similarly, based on the corrected values X f , Y f , Z f , the corrected cosine similarity for each segment C O S f j and the corrected overall skeletal cosine similarity C O S f H can be obtained.
M k , v = x ~ k , v 1 y ~ k , v 1 z ~ k , v 1 x ~ k , v 2 y ~ k , v 2 z ~ k , v 2 x ~ k , v 16 y ~ k , v 16 z ~ k , v 16 x ~ k , v 2 y ~ k , v 2 z ~ k , v 2 x ~ k , v 7 y ~ k , v 7 z ~ k , v 7 x ~ k , v 4 y ~ k , v 4 z ~ k , v 4 x ~ k , v 8 y ~ k , v 8 z ~ k , v 8 x ~ k , v 5 y ~ k , v 5 z ~ k , v 5 x ~ k , v 5 y ~ k , v 5 z ~ k , v 5 x ~ k , v 5 y ~ k , v 5 z ~ k , v 5 x ~ k , v 5 y ~ k , v 5 z ~ k , v 5 x ~ k , v 10 y ~ k , v 10 z ~ k , v 10 x ~ k , v 16 y ~ k , v 16 z ~ k , v 16 x ~ k , v 12 y ~ k , v 12 z ~ k , v 12 x ~ k , v 13 y ~ k , v 13 z ~ k , v 13 x ~ k , v 14 y ~ k , v 14 z ~ k , v 14 x ~ k , v 15 y ~ k , v 15 z ~ k , v 15 x ~ k , v 6 y ~ k , v 6 z ~ k , v 6 x ~ k , v 14 y ~ k , v 14 z ~ k , v 14
C O S j = M k j M v j M k j × M v j
C O S H = j = 1 16 C O S j W j j = 1 16 W j
where W j represents the weight matrix corresponding to each skeletal segment, with its values detailed in Table 3. Considering that the Kinect exhibits relatively higher recognition accuracy for the torso region and larger deviations at the extremities, such as hands and feet, the weights of the terminal skeletal segments are set to the highest values. This weighting scheme aims to reflect the contribution of each skeletal segment to the overall recognition accuracy and to more prominently demonstrate the recognition and correction status of the limbs.

3. Results

3.1. Results of Static Tests

3.1.1. Cosine Similarity Data

The cosine similarity data obtained from static tests are presented in Figure 3 and Figure 4, with the raw data provided in Appendix A Table A1, Table A2, Table A3 and Table A4.

3.1.2. Visualized Skeletal Data

(1) Distance variation
The visualized skeletal data obtained from static tests when the subject faced the Kinect are shown in Figure 5, with the raw data provided in Appendix A Table A1.
(2) Orientation variation
The visualized skeletal data obtained from static tests when the subject maintained an arms-outstretched pose and a specific orientation are shown in Figure 6. The raw data for the 2000 mm distance are provided in Appendix A Table A2.

3.2. Analysis of Static Test Results

3.2.1. Data Accuracy Analysis

(1) Influence of Distance
As shown in Figure 5, the recognized values X ~ k , Y ~ k , Z ~ k from the Kinect exhibit three-dimensional spatial deviations from their ground truth X v , Y v , Z v , with relatively larger deviations observed along the x-axis. However, for a variety of static poses, the skeletal morphology directly obtained by the Kinect (red) generally resembles the ground truth skeletal morphology (blue). Combined with the cosine similarity results in Figure 3, it can be inferred that within the distance range of 2000 mm to 3500 mm, the Kinect demonstrates good recognition accuracy. Both excessively close and excessively distant positions may lead to a decline in the accuracy of the Kinect’s skeletal recognition.
For example, in the test involving the arms swinging (left arm forward) pose, when the distance was less than 3300 mm, C O S H remained at an excellent level of approximately 0.97. However, when the distance exceeded 3300 mm, it decreased to approximately 0.94. In the experiments with the arms-outstretched pose, C O S H gradually increased from 0.95 at 1500 mm to 0.98 at 2000 mm and then stabilized around 0.98. This suggests that both excessively close and excessively distant positions relative to the Kinect camera can potentially reduce recognition accuracy.
(2) Influence of Orientation
The cosine similarity results in Figure 4 demonstrate that changes in orientation directly influence the C O S H values. For instance, the curves of C O S 9 (Left Forearm) and C O S 11 (Right Forearm) exhibit a pattern of being higher in the middle (when the yaw angle is near 0°) and lower at the extremes (when the yaw angle deviates further from 0°). This indicates that as the yaw angle increasingly deviates from 0°, the accuracy of skeletal morphology recognition decreases, which is consistent with the skeletal morphology recognition results shown in Figure 6.
Figure 6 also clearly shows that within the range of ±[0°~15°], particularly at 0°, the recognition of the human skeleton is good. Within the range of ±[15°~30°], the recognition of most joints is good. However, as the distance increases, some joints exhibit larger recognition errors. For example, at a distance of 4000 mm and an orientation of 30°, the position of one of the wrists exhibited a significant unexpected deviation. As the yaw angle further increases, this phenomenon becomes more prevalent. For instance, at an orientation of ±45°, significant deviations in the position of one of the wrists appear across all distances. This significant deviation in joint position recognition can also affect the representation of skeletal joint angles. For example, at an orientation of +45°, the Kinect’s recognition of the spatial position of the left wrist node showed a large error, which also led to a deviation in the left elbow joint angle.
This phenomenon of significant recognition deviations at large orientation angles occurs more frequently at distances exceeding 3000 mm. However, even within the 2000 mm to 3000 mm range, which is empirically considered to generally provide good recognition, a small number of experimental groups exhibited this issue.
(3) Influence of Skeletal Distribution
The C O S j data reveal that the Kinect demonstrates good recognition accuracy for the human torso, while larger errors and a corresponding decrease in accuracy are more likely to occur in the limbs. Notably, the recognition accuracy of the Kinect for the upper limbs is significantly lower than that for the lower limbs, as clearly evidenced by the results presented in Figure 3 and Figure 4. For example, in the first two tests in Figure 3, C O S 11 (Right Forearm) and C O S 12 (Right Upper Arm), and in the latter two tests, C O S 9 (Left Forearm) and C O S 10 (Left Upper Arm), each exhibited poor performance and significantly impacted the overall cosine similarity.
The standard deviation data corresponding to the different joint values X ~ k , Y ~ k , Z ~ k reveal the direct cause of the deviations that tend to occur at the terminal joints of the limbs. Figure 7 sequentially presents the standard deviation data for the spatial distances of two torso joint nodes (Spine_Chest, Neck), two lower limb joint nodes (Knee_Right, Ankle_Right), and three upper limb joint nodes on the same side as the lower limbs (Shoulder_Right, Elbow_Right, and Wrist_Right). In the results shown in Figure 7 for a yaw angle of 30°, the standard deviations of all data are very small. However, as the yaw angle increases to 45°, the fluctuation in the standard deviation of the spatial position data of limb joints, especially the terminal joints, increases significantly, leading to a corresponding decrease in the recognition accuracy of these joints. The significant bending of the terminal bones of the arm in the skeletal morphology shown in Figure 6 can be attributed to this phenomenon. Furthermore, this high standard deviation also implies that instantaneous results for the terminal joints of the limbs should not be overly relied upon when the yaw angle is large, even if they appear to be relatively accurate at that moment.
Moreover, the joint position standard deviation data shown in Figure 7 further confirm that the Kinect’s recognition accuracy for the torso remains consistently high. This is because, under various distance and orientation configurations, the standard deviation of the position data of torso joints remains consistently low.

3.2.2. Effectiveness of Data Correction

The effectiveness of the correction applied to the Kinect data is clearly observable from the graphical representation of the skeletal data. In Figure 5, the corrected human skeletal data, represented by black lines, although still retaining the characteristic bending at the arms, are significantly closer to the ground truth morphology represented by blue lines than the original morphology represented by red lines. Moreover, the magnitude of this unexpected bending is also markedly reduced. Additionally, the forward-leaning deviation present in the original data is mitigated to a certain extent. Similar results are observed in Figure 6, where the initially pronounced arm deviations are substantially suppressed. Therefore, the Kinect recognition deviations, influenced by multiple factors such as distance and orientation, can be reduced to varying degrees through fitting-based correction.
The changes in cosine similarity before and after correction also corroborate this effect (detailed data are provided in Appendix A Table A3 and Table A4). The C O S f H values after fitting correction show an improvement rate ranging from 26% to 74% compared to the C O S H values before correction, indicating an effective enhancement in recognition accuracy across all test groups. However, it should be noted that with the multivariate linear regression method employed in this study, certain characteristics of the original data are still partially retained, meaning that deviations cannot be completely eliminated.

3.3. Results of Dynamic Tests

The cosine similarity results obtained from the various dynamic tests are presented in Figure 8, while the graphical skeletal data are provided in video format in the Appendix A Materials. The marching in place and forward and backward walking tests were performed by a 24-year-old male subject with a height of 1.85 m and a weight of 70 kg. The lateral walking and in-place body rotation tests were performed by another 25-year-old male subject with a height of 1.78 m and a weight of 71 kg. These two human subjects had comparable heights and physiques to the aforementioned static mannequin and wore clothing similar to that of the mannequin, thereby facilitating the correlation and comparative analyses between the dynamic and static test results.

3.4. Analysis of Dynamic Tests

(1) Influence of Distance in the Depth Direction
The results of the forward and backward walking test provide the most direct evidence of the correlation between distance in the depth direction and recognition accuracy. In Figure 8a, the data frame index on the horizontal axis corresponds to the complete process of the subject walking forward from a distance of 3500 mm to 1500 mm and then immediately retrogressing to 3500 mm. When the subject was within the intermediate distance range of [2000 mm~3500 mm], the recognition results are stable and good ( C O S H > 0.9 ). However, when the subject was too close or too far, the recognition results exhibit significant fluctuations, and the accuracy decreased ( C O S H > 0.75 ). The trend of recognition accuracy with respect to distance in the depth direction and the stable range is consistent with the results of the static tests, although the accuracy is slightly lower than that of the static standing tests.
The marching in place test further demonstrates the sensitivity of recognition accuracy to distance variations in the depth direction under dynamic conditions. In the results of the marching in place test at a distance of 2500 mm (the first 300 frames) shown in Figure 8b, as the arms swing back and forth, the accuracy variations of the arm on the same side are consistent. For example, the variations in the left forearm and the left upper arm are synchronized, and the accuracy of the forearm is significantly lower than that of the upper arm ( C O S 9 < C O S 10 ), which is consistent with the difference in the range of motion in the depth direction between these two skeletal segments. Moreover, the variation patterns of the two arms alternate. However, it should be noted that when the depth distance is reduced to 2000 mm, the recognition accuracy of the shoulder joint begins to decrease as the shoulder data approaches the upper edge of the image. At this point, the recognition deviation of the shoulder bones (Left Scapula, Right Scapula) becomes significantly higher than that of the upper arm and close to that of the forearm. This is also the reason why the recognition accuracy of the hip bones (Left Hip, Right Hip), which are closer to the center of the image, does not change significantly as the distance changes from 2500 mm to 2000 mm.
(2) Influence of Lateral Distance
The trend of recognition accuracy with respect to lateral distance also reveals the existence of an optimal range of [−0.5 m~0.5 m], corresponding to the position directly in front of the Kinect lens. As the subject moves laterally away from the lens, the recognition accuracy decreases. In the lateral walking experiment shown in Figure 8c, the subject initially faced the Kinect lens. The test began with the subject taking a sidestep to the left boundary, then walking to the right, passing the initial position, and continuing to the right boundary, and finally walking back to the left to the initial position. Taking the test at a z-direction distance of 2500 mm as an example, during this process, the recognition performance in the central region [−0.5 m~0.5 m] was excellent ( C O S H > 0.95 ), while at the two extreme sides, the recognition accuracy significantly decreased ( C O S H > 0.74 ).
The lateral walking test further validates the influence of image edge factors on recognition accuracy. At distances of 2500 mm and 3000 mm, since the ankle joint is closer to the bottom of the image than the knee joint, the recognition accuracy of the lower leg bones is generally lower than that of the upper leg bones (consistent with the results of the static tests), i.e., C O S 1 < C O S 2 and C O S 3 < C O S 4 . However, when the distance is further reduced to 2000 mm, as the subject moves laterally, both the ankle and knee joints approach the side edges of the image, causing the recognition deviation of the upper leg bones to become similar to, and in some areas even exceed, that of the lower leg bones, i.e., C O S 1 C O S 2 and C O S 3 C O S 4 .
(3) Influence of Occlusion
Occlusion is another important factor that directly affects recognition accuracy, as demonstrated in the in-place body rotation test. When the human body is oriented sideways to the Kinect lens, some body regions are occluded by the body itself, leading to recognition deviations. In the in-place body rotation experiment shown in Figure 8d, the subject initially faced the Kinect lens, then rotated to the left and returned to the initial position, and then rotated to the right and returned to the initial position. The rotation angle was approximately [−15°~15°]. Furthermore, unlike the static tests, the upper limbs remained naturally hanging down during the rotation. During this process, the overall trend of C O S H was similar to that of the static tests, with the highest recognition accuracy when facing the Kinect directly ( C O S H > 0.95 ) and the lowest at the maximum rotation angle ( C O S H 0.76 ). During rotation, the hip and shoulder regions are most susceptible to occlusion due to their minimal corresponding body area and their location on the opposite side of the body. This is directly reflected in the lowest recognition accuracy of the hip and shoulder bones, with the accuracy directly correlating with the rotation angle.
In the aforementioned process, although the subject was positioned directly in front of the Kinect lens and was not near the lateral edges of the image, the phenomenon of the recognition deviation of the upper leg bones exceeding that of the lower leg bones, as observed in the lateral walking test, occurred again. We believe that as the rotation angle increases, not only is the hip joint most affected by occlusion, but partial occlusion of the upper leg region also leads to deviations in the Kinect’s recognition of the knee joint position. These factors directly affect the recognition accuracy of the upper leg bones. During rotation, the variation in recognition accuracy of the upper leg bones showed a certain consistency with that of the hip bones, such as the right hip bone (Right Hip) and the right upper leg bone (Right Thigh), which is consistent with our analysis.
(4) Effectiveness of Correction
Observing C O S H and C O S f H in Figure 8, it can be seen that after correcting the dynamic test data, the overall cosine similarity results improved by about 2% (calculation data are provided in Appendix A Table A5), indicating a very limited improvement. This is partly because, in the dynamic test results, C O S H itself maintained a relatively high level in most cases, leaving limited room for improvement. On the other hand, in regions with poor test results, the data fluctuate more drastically, which significantly limits the effectiveness of the linear fitting correction method employed in this study.

4. Discussion

(1) Recognition Accuracy
At the outset of this study, we were intrigued by the question of whether a body in absolute rest or a body in continuous motion corresponds to higher recognition accuracy for the Kinect. By comparing the results of the in-place body rotation test in dynamic tasks (Figure 8d) and the orientation test in static tasks (Figure 4, 2000 mm), it can be observed that the C O S H L value for the lower limbs in the static condition remains above 0.95 within the [−45°~45°] range. In contrast, even within the [−15°~15°] range in the dynamic condition, the C O S H L partially drops to 0.9. Therefore, it is evident that the Kinect camera exhibits higher recognition accuracy for a human skeleton in a static state. It is important to note that this comparison is based on the C O S H L value, representing the comprehensive cosine similarity of the six skeletal segments of the lower limbs, rather than the whole-body cosine similarity C O S H . This is because the upper limb postures in the two test tasks were not consistent.
Furthermore, regardless of whether the subject’s body is in a static or dynamic state, factors such as the distance and orientation relative to the Kinect camera, as well as self-occlusion, have a direct impact on recognition accuracy. The regions of high C O S H values, which were experimentally determined, are recommended for preferential use. Specifically, these include maintaining a distance in the depth direction within the range of [2000 mm~3000 mm], keeping the lateral distance as close as possible to [−0.5 m~0.5 m], and maintaining an orientation directly facing the Kinect or within the range of [−15°~15°]. Under these conditions, the C O S H value can typically be maintained above 0.9.
The recognition accuracy of bones located at the extremities of the limbs is generally lower than that of their parent bones. However, under certain circumstances, the opposite may occur, with the recognition accuracy of the parent bones significantly decreasing. These circumstances include the occurrence of visual occlusion (e.g., during rotation), excessively close depth distances (shoulder approaching the upper edge of the image), and excessively large lateral displacement (limbs approaching the side edges of the image).
(2) Limitations of the Tests
It is worth noting that this study did not employ a large number of human subjects, and the tested movements were relatively conventional. This was constrained by the diversity and uncertainty of the issues discussed in this paper. Subjects with varying individual characteristics exhibit personalized manifestations in terms of body morphology, movement behaviors, and preferred movement patterns. This paper primarily focuses on analyzing the impact of active, usage-related factors on the accuracy of Kinect’s built-in motion capture for a given adult user. These factors include relative distance and orientation to the device, occlusion, and typical movement behaviors. Factors associated with passive individual physical characteristics were not examined under a systematic testing and analysis in this study. Consequently, we intentionally selected static mannequins and human subjects with similar physiques and statures, which facilitated the analysis of the aforementioned active factors in the extensive datasets. It is worth noting that the relatively small sample size of subjects may introduce certain limitations to the generalizability of our findings. Nevertheless, we conducted some additional exploratory tests, and preliminary results indicate that the findings and discussions regarding the active factors demonstrate good robustness, provided that height and physique do not undergo substantial variations.
Furthermore, the tests were conducted under essentially constant indoor lighting conditions, primarily illuminated by artificial light sources. While this condition is similar to typical scenarios with ample natural light and aligns with the common usage of the device, it is acknowledged that conclusions drawn from this single condition may have potential limitations. Although we have considerable confidence in the robustness of the aforementioned conclusions regarding accuracy, variations in lighting and scene conditions could potentially influence the overall findings.
In summary, this paper focuses on investigating the overall performance of Azure Kinect DK in terms of motion recognition accuracy. In the future, with more specific and refined test objectives, a larger number and variety of subjects, as well as more diverse movements, will be considered. Specifically, further evaluation is needed regarding Kinect’s performance in capturing more complex or faster movements.
(3) Data Correction
After applying linear regression correction to the spatial position data of skeletal nodes obtained by the Kinect, the C O S f H in static tests showed a significant improvement compared to C O S H , while the improvement in dynamic tests was very limited. This is mainly attributed to two reasons. First, the static tests covered a wider range of distances and orientations, resulting in a more pronounced decrease in recognition accuracy at the respective boundary regions. This provided more potential for accuracy optimization. In contrast, the dynamic tests had a smaller test range, and the recognition accuracy remained relatively high throughout. Second, in dynamic tests, the continuously reciprocating ground truth data limited the effectiveness of the linear fitting method.
Although the data correction method employed in this study demonstrated overall effectiveness, its rapid deployment and application are clearly restricted. This is due to factors such as the need for additional equipment and steps, as well as inter-individual variability in correction parameters, all of which are contrary to the “plug-and-play” and “anytime, anywhere” advantages of the Kinect device. Nevertheless, we believe that this fundamental correction method and its preliminary results still hold value for informing the future development of more intelligent and convenient correction methods. It should be reiterated that the primary purpose of data correction in this study was to assess whether the accuracy of raw data obtained from the Kinect motion capture device can be significantly improved when precise ground truth data are available. The results of this study provide an affirmative answer to this question. Moreover, in future in-depth studies on the dynamic performance of Kinect, advanced methods for improvement will be concurrently considered.
(4) Implications of the Results for Kinect Application Research
The latest Azure Kinect DK device boasts advancements in camera resolution and depth data accuracy compared to its predecessors. However, it remains a consumer-grade, cost-effective motion capture device. When deployed extensively and independently in real-world applications such as medical treatment, rehabilitation, and real-time robot control, its inherent limitations in human skeletal recognition accuracy cannot be overlooked.
The results presented in this paper realistically reflect these limitations. For example, a subject maintaining a walking motion while regressing from 2000 mm to 3500 mm in the depth direction relative to the Kinect can experience an approximately 17% difference in recognition accuracy. This should be taken into account in applications that demand precise motion capture.
To pursue high-precision Kinect motion capture, this paper proposes recommended usage regions based on experimental results, providing actionable guidelines for researchers and developers utilizing this device. For instance, while Microsoft, the manufacturer of the Kinect device, suggests an optimal depth range of [0.5 m~3.86 m], this study further refines this range to [2 m~3.5 m] to acquire the most accurate representation of the complete human skeleton.
Furthermore, in application tasks where high-precision reference values can be obtained, such as in specific large-area human motion capture scenarios involving a high-precision professional motion capture system and multiple Kinect devices, the skeletal data directly acquired by the Azure Kinect DK can be further corrected based on its pre-acquired reference values to achieve higher accuracy. Even if the raw Kinect data are difficult to correct directly, their accuracy performance and potential for improvement can still be assessed.
Finally, it should be noted that the analysis of whole-body cosine similarity in this paper introduced weighting factors for different skeletal segments. However, the specific setting of these weights may depend on the researcher’s specific task and interests. Therefore, the raw data of unweighted joint nodes from multiple tests are also provided for researchers to utilize.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/s25041047/s1, Video S1: Accuracy performance of Azure Kinect DK in multiple dynamic tests.

Author Contributions

Conceptualization, W.J.; methodology, H.W.; software, H.W. and T.B.; validation, H.W., W.J. and Q.C.; formal analysis, H.W. and W.J.; investigation, H.W. and W.J.; resources, W.J. and Y.S.; data curation, H.W.; writing—original draft preparation, W.J. and H.W.; writing—review and editing, Q.C.; visualization, H.W., Q.C. and W.J.; supervision, W.J.; project administration, Y.S.; funding acquisition, W.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China, grant number 2020YFB1313003.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Materials. Further inquiries can be directed to the corresponding author.

Acknowledgments

Thanks to He Dong, Ding Zihan, and Shen Bufan for their suggestions on the improvement of the testing process.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A

Table A1. Spatial positioning data of body nodes for the mannequin maintaining an arms-outstretched pose and facing the Kinect camera directly in static tests.
Table A1. Spatial positioning data of body nodes for the mannequin maintaining an arms-outstretched pose and facing the Kinect camera directly in static tests.
Distance (mm)Node Index (i)Node NameKinect MeasurementsVisualeyez MeasurementsCorrected Kinect Values
x ~ k i
(mm)
y ~ k i
(mm)
z ~ k i
(mm)
x v i
(mm)
y v i
(mm)
z v i
(mm)
x f i
(mm)
y f i
(mm)
z f i
(mm)
15001Ankle_Right−1181.91−79.751565.90−945.499.291352.81−966.198.061490.07
2Knee_Right−762.39−76.921351.41−645.44−14.541305.08−663.76−20.161398.85
3Ankle_Left−1171.43143.461577.87−947.15189.161400.66−970.04194.941506.22
4Knee_Left−747.49154.391384.04−643.39197.341313.86−660.91203.101393.28
5Pelvis−272.3517.071302.66−237.6685.701243.17−248.4184.801274.80
6Spine_Naval−64.3611.121239.5877.1055.121187.3470.2152.671209.00
7Hip_Right−278.87−82.181290.54−235.361.531238.37−246.64−4.781273.75
8Hip_Left−265.12127.141316.11−221.42159.011260.53−232.98163.021289.47
9Wrist_Left217.50753.551218.26275.09759.951311.03263.01757.061297.46
10Elbow_Left276.75506.281341.67301.80507.061285.75306.25513.111377.20
11Wrist_Right136.45−690.551082.30172.24−632.041212.12180.39−628.131208.42
12Elbow_Right245.52−507.941273.79249.26−382.551222.50258.53−403.271326.40
13Shoulder_Right284.20−189.531152.06357.67−103.521232.79358.65−112.371233.73
14Neck367.475.601168.39408.8647.261227.52413.6548.411244.67
15Spine_Chest104.774.851200.00199.8549.111173.96197.2945.501190.52
16Shoulder_Left284.85215.781171.49358.02215.531237.51358.60226.911243.46
20001Ankle_Right−1016.73−125.231971.81−920.77−5.351873.28−915.52−14.011901.04
2Knee_Right−622.24−105.711795.11−619.80−29.721825.31−616.54−33.761842.45
3Ankle_Left−1008.7256.361988.34−922.10175.261918.26−917.52160.041927.88
4Knee_Left−614.7788.591823.68−588.03182.161832.79−585.62173.241835.99
5Pelvis−171.81−9.581780.61−211.9669.961762.81−209.9068.891759.34
6Spine_Naval21.34−9.791729.99102.5437.971707.32104.7939.161706.74
7Hip_Right−175.64−100.411766.37−209.87−14.811759.38−207.30−16.101755.45
8Hip_Left−167.5691.151796.40−196.20143.031778.89−194.73142.011776.99
9Wrist_Left263.87726.841725.54297.23742.991823.48288.25742.881812.57
10Elbow_Left321.61474.871748.98327.25491.171800.85326.41494.071795.62
11Wrist_Right139.72−711.431628.11197.53−648.901740.66181.54−641.171755.08
12Elbow_Right298.69−508.781680.13274.45−399.281748.29282.42−403.921744.10
13Shoulder_Right342.82−197.111665.62382.95−120.351755.42381.05−117.761754.38
14Neck418.65−17.571666.19434.2830.411748.57437.3132.631751.23
15Spine_Chest178.09−11.681699.94225.6231.751694.56228.1434.321698.10
16Shoulder_Left339.89175.641672.99383.64198.841755.97381.10199.851751.66
25001Ankle_Right−1060.83−152.402542.82−896.70−25.532397.84−919.05−27.202479.15
2Knee_Right−640.00−142.172343.19−594.88−49.292351.18−622.53−50.992390.42
3Ankle_Left−1045.1326.532584.71−896.53154.422448.32−919.27148.082540.50
4Knee_Left−631.3761.692384.41−593.07162.402365.70−621.28161.032400.63
5Pelvis−155.17−42.602326.97−186.8950.982292.21−203.5249.172313.24
6Spine_Naval50.90−43.012265.70128.1618.962237.43116.7217.692250.44
7Hip_Right−158.93−140.872314.97−184.96−33.982287.68−200.93−41.232310.82
8Hip_Left−150.9966.392340.28−170.90123.502310.15−188.23127.562329.06
9Wrist_Left331.05678.312242.97337.60720.332368.43324.83717.112337.98
10Elbow_Left386.29441.512380.56353.28469.492340.65355.47473.842444.42
11Wrist_Right247.05−747.452113.13219.94−670.372262.48219.23−663.672240.85
12Elbow_Right350.42−560.282298.51300.10−419.902273.81305.66−443.842379.75
13Shoulder_Right397.92−243.962185.47407.49−141.172285.11402.10−151.022281.42
14Neck478.36−51.112199.62460.139.302281.55464.919.772294.06
15Spine_Chest218.20−44.622226.38251.1912.352226.67245.0212.042232.58
16Shoulder_Left404.24159.232204.88409.50178.082291.34407.40188.782290.65
30001Ankle_Right−852.08−175.452945.98−863.94−50.712905.30−855.01−38.382887.33
2Knee_Right−469.95−171.322800.78−573.06−74.612859.93−565.23−64.762847.92
3Ankle_Left−847.4918.442967.13−874.69128.282958.82−865.47144.842933.34
4Knee_Left−460.1924.552850.29−541.29136.122875.86−532.86144.182869.76
5Pelvis−42.88−69.372806.34−165.8425.532803.62−160.5133.182799.22
6Spine_Naval137.46−70.522750.90147.82−6.322750.15151.66−0.082742.88
7Hip_Right−47.41−155.232793.38−163.57−59.572798.66−158.42−50.152795.13
8Hip_Left−37.8625.842820.70−150.3098.172822.96−143.88103.892816.72
9Wrist_Left390.15630.442756.54359.58696.722878.24357.00691.692859.47
10Elbow_Left426.71387.992762.76374.98444.692852.62373.64441.402837.05
11Wrist_Right313.27−776.632623.61242.92−695.692775.43242.49−681.902752.12
12Elbow_Right379.16−544.152681.01320.50−445.112786.35318.57−431.342772.94
13Shoulder_Right439.92−252.982695.11353.53−165.632795.52418.14−157.432798.10
14Neck511.48−84.142684.57482.28−15.802793.51480.22−12.732787.55
15Spine_Chest284.53−73.242717.53271.50−12.782738.84272.92−7.322731.24
16Shoulder_Left458.51102.082711.38430.74149.872805.81429.59150.242803.92
35001Ankle_Right−843.76−188.583469.32−847.25−48.323422.10−852.46−44.753417.18
2Knee_Right−449.80−191.143328.33−555.82−73.823375.94−558.44−74.123375.36
3Ankle_Left−831.07−13.753492.16−856.42131.373474.51−860.17131.943472.68
4Knee_Left−437.040.813365.65−522.68136.173392.06−524.96133.403388.72
5Pelvis−11.58−88.513321.31−148.7123.033320.42−148.5121.763321.30
6Spine_Naval173.23−92.703265.50165.49−10.893266.01166.10−14.413265.16
7Hip_Right−17.71−176.223307.60−146.40−62.233314.63−147.10−63.183315.69
8Hip_Left−4.788.773336.52−131.4195.263337.56−130.9193.923340.30
9Wrist_Left462.71624.603279.20381.98690.613394.92396.50688.593390.20
10Elbow_Left487.60374.903274.94395.81437.693370.75401.00433.473363.19
11Wrist_Right366.77−829.633213.35255.05−700.463288.55261.28−715.023342.78
12Elbow_Right420.85−581.353206.38335.94−450.803306.32337.30−460.173312.98
13Shoulder_Right487.41−284.033200.72441.76−172.733314.69436.29−179.483310.70
14Neck555.48−110.103194.40499.63−22.983310.59500.56−30.423306.36
15Spine_Chest323.87−98.023232.18288.13−18.273255.94289.48−24.073253.74
16Shoulder_Left505.2181.123221.72449.92143.063322.03448.67136.113321.07
40001Ankle_Right−801.31−253.343978.51−825.34−81.383935.26−839.44−76.183932.71
2Knee_Right−406.94−248.463836.56−532.70−106.283889.84−543.99−101.203883.49
3Ankle_Left−788.13−90.614027.84−833.9795.883989.49−846.31101.134022.95
4Knee_Left−402.93−46.363880.45−501.10101.163911.69−513.32111.993907.12
5Pelvis27.92−141.493834.27−126.30−9.393836.04−133.38−9.893841.33
6Spine_Naval214.01−136.953781.66187.21−42.883781.51182.55−43.003789.03
7Hip_Right25.93−229.413819.62−123.78−95.233830.22−130.47−96.223834.03
8Hip_Left30.12−44.003850.51−109.2562.143856.45−117.2263.123862.03
9Wrist_Left470.53591.763800.17405.81656.283925.66400.75671.153919.20
10Elbow_Left519.99345.263795.94418.25402.753895.16415.55415.503898.40
11Wrist_Right341.20−814.033663.16276.58−734.503795.31252.30−705.273793.28
12Elbow_Right446.42−594.313700.88356.54−484.363812.61348.79−470.213821.30
13Shoulder_Right563.46−315.183711.85463.59−206.193825.51465.35−201.603828.89
14Neck598.04−136.933715.04521.04−56.243825.01520.23−48.703836.17
15Spine_Chest365.04−134.833748.48309.80−50.183769.54306.80−48.973777.93
16Shoulder_Left562.5553.833740.89481.37101.563838.78472.11117.713847.18
Table A2. Spatial positioning data of body nodes for the mannequin maintaining an arms-outstretched pose at a 2000 mm distance directly in front of the Kinect camera in static tests.
Table A2. Spatial positioning data of body nodes for the mannequin maintaining an arms-outstretched pose at a 2000 mm distance directly in front of the Kinect camera in static tests.
Orientation (°)Node Index (i)Node NameKinect MeasurementsVisualeyez MeasurementsCorrected Kinect Values
x ~ k i
(mm)
y ~ k i
(mm)
z ~ k i
(mm)
x v i
(mm)
y v i
(mm)
z v i
(mm)
x f i
(mm)
y f i
(mm)
z f i
(mm)
−45°1Ankle_Right−1002.8832.671944.30−912.39−49.861903.94−913.48−30.961915.92
2Knee_Right−611.03−31.961789.64−617.12−87.001869.38−619.78−80.451859.18
3Ankle_Left−1019.26123.631807.77−933.89107.491814.97−934.16114.501832.25
4Knee_Left−610.39112.461704.93−613.1776.261739.73−609.1581.331740.91
5Pelvis−171.0829.981717.98−231.22−35.491747.50−229.53−32.781757.14
6Spine_Naval17.854.951669.4193.81−85.341720.1195.79−87.211724.48
7Hip_Right−171.88−51.811757.03−227.17−101.161801.83−226.21−93.191807.13
8Hip_Left−170.19120.681674.68−219.6628.571707.60−216.7930.601722.37
9Wrist_Left197.28500.941114.13239.60474.761266.09247.25478.591239.42
10Elbow_Left264.65346.361306.31276.56286.761443.42284.31285.301430.12
11Wrist_Right244.73−421.351742.77239.95−514.382250.50252.38−489.632255.87
12Elbow_Right314.49−350.701981.76298.73−341.522079.83328.91−321.212197.32
13Shoulder_Right361.22−152.561747.47383.80−143.281857.09386.83−136.401870.65
14Neck409.08−27.551613.72424.83−43.351739.21431.27−45.701746.07
15Spine_Chest172.57−8.291644.84219.83−93.251712.80219.86−97.721716.92
16Shoulder_Left323.62106.881485.11370.5460.561632.37375.3553.351633.43
−30°1Ankle_Right−1027.16−6.462005.28−912.53−63.771902.22−912.58−48.331945.99
2Knee_Right−628.36−48.361844.79−616.33−97.551861.02−618.37−87.021863.72
3Ankle_Left−1035.24125.131843.72−932.03104.591837.98−935.44117.761849.36
4Knee_Left−623.73113.941726.69−611.6983.661758.95−613.7295.901760.02
5Pelvis−182.5321.481738.70−230.55−29.991749.61−231.45−16.891750.38
6Spine_Naval9.69−6.011692.6394.73−79.551716.9694.49−65.331715.91
7Hip_Right−188.09−65.681767.53−227.40−103.371793.43−228.21−89.901789.75
8Hip_Left−176.37118.141706.73−218.2339.541720.44−219.0753.821726.17
9Wrist_Left223.48574.361211.85244.81546.261355.94235.99569.391348.24
10Elbow_Left275.81383.911380.07280.29332.421500.62277.64352.661494.91
11Wrist_Right227.67−531.992026.58239.14−580.102177.82241.67−583.752174.6
12Elbow_Right314.40−426.561950.92299.25−384.632025.80298.35−377.302021.02
13Shoulder_Right356.32−182.461756.34384.42−159.901843.82386.61−147.361842.95
14Neck406.71−39.921638.10425.76−43.831743.00427.29−28.591743.18
15Spine_Chest166.07−24.461666.99221.24−85.021707.89221.79−72.081706.12
16Shoulder_Left327.91118.671530.04372.9379.531652.55374.7798.461651.75
−15°1Ankle_Right−942.66−81.311901.17−912.67−81.431889.45−896.32−62.121890.94
2Knee_Right−555.20−95.501787.27−616.96−106.271843.09−606.35−90.281829.14
3Ankle_Left−934.45115.291857.14−930.0498.131874.26−916.18119.701878.45
4Knee_Left−551.7293.391743.29−609.8995.191794.02−598.06111.901794.09
5Pelvis−136.249.081744.51−229.85−15.511754.82−222.573.191751.84
6Spine_Naval43.26−5.661696.1894.88−58.691711.41100.99−38.061707.85
7Hip_Right−138.47−75.061761.87−227.96−97.951777.76−220.05−78.651771.85
8Hip_Left−133.76102.391725.26−216.3759.281745.58−209.0578.551746.32
9Wrist_Left281.34640.751414.72255.48637.331528.04275.03651.591582.82
10Elbow_Left311.36418.121505.85287.28393.161611.74291.95413.501628.86
11Wrist_Right284.02−687.561889.89229.53−662.532023.94235.90−657.392019.26
12Elbow_Right318.46−454.311821.72294.19−435.841933.16277.73−427.491937.79
13Shoulder_Right348.34−184.931707.01383.02−172.121815.31385.52−154.981804.95
14Neck410.80−36.421623.69426.52−34.471748.00430.28−15.481748.50
15Spine_Chest188.19−16.371662.91221.63−62.531702.20226.70−43.221697.91
16Shoulder_Left343.78139.671575.41375.68108.791691.75379.76128.731701.92
1Ankle_Right−1016.73−125.231971.81−910.77−5.351873.28−905.52−14.011901.04
2Knee_Right−622.24−105.711795.11−619.80−29.721825.31−616.54−33.761842.45
3Ankle_Left−1008.7256.361988.34−922.10175.261918.26−917.52160.041927.88
4Knee_Left−614.7788.591823.68−588.03182.161832.79−585.62173.241835.99
5Pelvis−171.81−9.581780.61−211.9669.961762.81−209.9068.891759.34
6Spine_Naval21.34−9.791729.99102.5437.971707.32104.7939.161706.74
7Hip_Right−175.64−100.411766.37−209.87−14.811759.38−207.30−16.101755.45
8Hip_Left−167.5691.151796.40−196.20143.031778.89−194.73142.011776.99
9Wrist_Left263.87726.841725.54297.23742.991823.48288.25742.881812.57
10Elbow_Left321.61474.871748.98327.25491.171800.85326.41494.071795.62
11Wrist_Right139.72−711.431628.11197.53−648.901740.66181.54−641.171755.08
12Elbow_Right298.69−508.781680.13274.45−399.281748.29282.42−403.921744.10
13Shoulder_Right342.82−197.111665.62382.95−120.351755.42381.05−117.761754.38
14Neck418.65−17.571666.19434.2830.411748.57437.3132.631751.23
15Spine_Chest178.09−11.681699.94225.6231.751694.56228.1434.321698.10
16Shoulder_Left339.89175.641672.99383.64198.841755.97381.10199.851751.66
15°1Ankle_Right−984.67−121.621882.06−929.25−26.821825.63−927.29−34.471851.35
2Knee_Right−595.03−109.031707.27−634.96−41.661777.48−636.81−36.431774.68
3Ankle_Left−974.3360.961990.39−930.93142.591893.71−924.32146.161935.49
4Knee_Left−592.4783.011810.49−612.19157.371840.44−606.64157.571837.52
5Pelvis−153.99−5.121751.85−233.1578.981744.09−229.1985.161739.42
6Spine_Naval33.73−1.541691.7190.0743.081685.8994.7753.491685.95
7Hip_Right−154.85−95.621743.78−234.93−1.031722.72−231.394.011717.35
8Hip_Left−153.0495.241760.79−219.39149.731772.24−215.18156.451770.56
9Wrist_Left228.25722.381771.50240.26731.071880.25251.22739.531840.89
10Elbow_Left298.19476.901748.38278.60480.961831.46286.02490.251818.82
11Wrist_Right273.34−723.051474.17296.09−653.511632.49320.39−629.031625.09
12Elbow_Right316.36−480.411555.63310.44−409.421673.01350.78−378.841689.04
13Shoulder_Right358.32−179.001620.47378.35−116.201714.08377.62−103.381717.54
14Neck424.770.881623.38426.0644.121730.31432.4856.471735.69
15Spine_Chest186.900.311654.60215.6741.381672.70221.0252.581671.88
16Shoulder_Left357.27190.751659.61381.81194.421747.79387.72205.481752.88
30°1Ankle_Right−961.56−127.331843.05−930.58−13.371796.87−924.88−14.381828.54
2Knee_Right−584.75−90.111672.12−635.62−19.111747.83−636.64−16.271742.99
3Ankle_Left−943.1146.521997.76−930.74135.161904.42−920.25142.211944.43
4Knee_Left−574.9485.261818.96−611.52158.801857.32−605.24165.051856.77
5Pelvis−152.81−10.481724.53−232.25102.161748.15−230.40107.451750.23
6Spine_Naval33.21−2.091678.9891.1277.991684.9494.3379.621686.50
7Hip_Right−155.77−94.381697.00−234.6629.751708.09−234.7035.171709.94
8Hip_Left−149.5482.551755.07−217.69163.871792.63−213.86168.671796.54
9Wrist_Left245.88556.861635.89247.38697.022043.84258.79711.821912.97
10Elbow_Left286.85435.281849.49282.86465.681935.11288.07464.291934.20
11Wrist_Right241.91−654.731367.10290.47−587.021462.72316.28−565.361494.09
12Elbow_Right293.88−425.331458.58306.85−360.101562.81324.58−366.681579.94
13Shoulder_Right356.87−151.101569.83377.42−86.351675.26379.49−88.081675.85
14Neck414.2118.921619.28427.0664.931731.18430.8462.951736.03
15Spine_Chest183.302.681648.34216.7778.711672.56220.6481.011672.99
16Shoulder_Left355.78190.081696.86383.68206.971784.32390.44206.701791.37
45°1Ankle_Right−988.68−122.791823.42−931.728.071767.61−925.227.841774.56
2Knee_Right−604.83−72.161657.31−636.2411.191720.32−633.5917.251715.18
3Ankle_Left−965.666.291993.24−930.76125.951908.22−919.36109.261939.91
4Knee_Left−588.7665.841824.38−610.89157.011869.75−603.10161.931870.89
5Pelvis−164.75−11.891695.88−231.34124.341752.69−228.31129.731751.49
6Spine_Naval24.491.231652.3592.25112.531687.4595.75118.381683.65
7Hip_Right−169.19−89.721653.51−234.2063.861696.64−232.1767.971698.83
8Hip_Left−159.8274.421742.87−216.72173.341810.98−212.65179.411806.20
9Wrist_Left223.74467.211706.84252.34625.122188.58258.00628.872177.64
10Elbow_Left302.89368.261921.51286.44426.352026.10290.77430.412010.70
11Wrist_Right225.72−545.641161.43285.49−481.781310.37287.23−484.871321.89
12Elbow_Right291.07−379.301341.14303.53−285.621463.11303.46−261.941466.59
13Shoulder_Right343.78−133.401511.05377.11−47.771640.17376.33−36.741635.05
14Neck413.1314.641600.52427.5485.081731.03429.6093.461721.73
15Spine_Chest177.986.121627.35217.77115.141675.48221.16121.321671.19
16Shoulder_Left365.38170.771709.64385.60210.491817.69391.93216.101807.59
Table A3. Cosine similarities calculated from the data in Appendix A Table A1.
Table A3. Cosine similarities calculated from the data in Appendix A Table A1.
Distance1500 mm2000 mm2500 mm3000 mm3500 mm4000 mm
Orientation C O S H C O S f H C O S H C O S f H C O S H C O S f H C O S H C O S f H C O S H C O S f H C O S H C O S f H
0.42830.73270.71270.98490.41680.65790.76980.92140.81270.97960.78720.9841
Table A4. Cosine similarities calculated from the data in Appendix A Table A2.
Table A4. Cosine similarities calculated from the data in Appendix A Table A2.
Distance2000 mm2500 mm3000 mm3500 mm4000 mm
Orientation C O S H C O S f H C O S H C O S f H C O S H C O S f H C O S H C O S f H C O S H C O S f H
−45°−0.13620.9853−0.05040.9845−0.03630.97280.56720.9507−0.03680.9596
−30°0.07240.84930.74360.92430.76600.93750.75040.92260.02860.7229
−15°0.71120.96530.73130.95270.74270.97320.73430.9569−0.07760.9753
0.71270.98490.41680.65790.76980.92140.81270.97960.78720.9841
15°0.79280.97260.81220.98880.75610.95260.67300.92900.59980.9351
30°−0.10090.9770−0.08330.96670.16930.92670.69410.95310.70110.9866
45°−0.29690.86190.53750.92030.28500.28590.18410.69470.0205−0.0486
Table A5. Average cosine similarities in dynamic tests.
Table A5. Average cosine similarities in dynamic tests.
Test ScenarioDistance (mm) C O S H C O S f H
Forward and backward walking-0.97440.9781
Lateral walking20000.90030.9162
25000.97620.9960
30000.98200.9961
Marching in place20000.97700.9949
25000.98140.9875
In-place body rotation20000.96250.9681

References

  1. Ren, Z.; Yuan, J.; Meng, J.; Zhang, Z. Robust Part-Based Hand Gesture Recognition Using Kinect Sensor. IEEE Trans. Multimed. 2013, 15, 1110–1120. [Google Scholar] [CrossRef]
  2. Du, G.; Zhang, P.; Mai, J.; Li, S. Markerless Kinect-Based Hand Tracking for Robot Teleoperation. Int. J. Adv. Robot. Syst. 2012, 9, 36–44. [Google Scholar] [CrossRef]
  3. Ramos, J.; Kim, S. Dynamic Locomotion Synchronization of Bipedal Robot and Human Operator via Bilateral Feedback Teleoperation. Sci. Robot. 2019, 4, eaav4282. [Google Scholar] [CrossRef] [PubMed]
  4. Kofman, J.; Wu, X.; Luu, T.J.; Verma, S. Teleoperation of a Robot Manipulator Using a Vision-Based Human-Robot Interface. IEEE Trans. Ind. Electron. 2005, 52, 1206–1219. [Google Scholar] [CrossRef]
  5. Verma, S. Vision-Based Markerless 3D Human-Arm Tracking. Master’s Thesis, University of Ottawa, Ottawa, ON, Canada, 2005. [Google Scholar]
  6. Tölgyessy, M.; Dekan, M.; Chovanec, Ľ.; Hubinský, P. Evaluation of the Azure Kinect and Its Comparison to Kinect v1 and Kinect v2. Sensors 2021, 21, 413. [Google Scholar] [CrossRef]
  7. Xing, Q.J.; Shen, Y.Y.; Cao, R.; Zong, S.X.; Zhao, S.X.; Shen, Y.F. Functional Movement Screen Dataset Collected with Two Azure Kinect Depth Sensors. Sci. Data 2022, 9, 104. [Google Scholar] [CrossRef] [PubMed]
  8. Microsoft. About Azure Kinect DK. Microsoft Learn. Available online: https://learn.microsoft.com/previous-versions/azure/kinect-dk/about-azure-kinect-dk (accessed on 25 December 2024).
  9. Amprimo, G.; Masi, G.; Priano, L.; Azzaro, C.; Galli, F.; Pettiti, G.; Mauro, A.; Ferraris, C. Assessment Tasks and Virtual Exergames for Remote Monitoring of Parkinson’s Disease: An Integrated Approach Based on Azure Kinect. Sensors 2022, 22, 8173. [Google Scholar] [CrossRef] [PubMed]
  10. Wang, T.; Li, C.; Wu, C.; Zhao, C.; Sun, J.; Peng, H.; Hu, X.; Hu, B. A Gait Assessment Framework for Depression Detection Using Kinect Sensors. IEEE Sens. J. 2021, 21, 3260–3270. [Google Scholar] [CrossRef]
  11. Bertram, J.; Krüger, T.; Röhling, H.M.; Jelusic, A.; Mansow-Model, S.; Schniepp, R.; Wuehr, M.; Otte, K. Accuracy and Repeatability of the Microsoft Azure Kinect for Clinical Measurement of Motor Function. PLoS ONE 2023, 18, e0279697. [Google Scholar] [CrossRef] [PubMed]
  12. Bärligea, A.; Hase, K.; Yoshida, M. Simulation of Human Movement in Zero Gravity. Sensors 2024, 24, 1770. [Google Scholar] [CrossRef] [PubMed]
  13. Nascimento, H.; Mujica, M.; Benoussaad, M. Collision Avoidance in Human-Robot Interaction Using Kinect Vision System Combined with Robot’s Model and Data. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 10293–10298. [Google Scholar] [CrossRef]
  14. Zhang, J.; Li, P.; Zhu, T.; Zhang, W.A.; Liu, S. Human Motion Capture Based on Kinect and IMUs and Its Application to Human-Robot Collaboration. In Proceedings of the 2020 5th International Conference on Advanced Robotics and Mechatronics (ICARM), Shenzhen, China, 18–21 December 2020; pp. 392–397. [Google Scholar] [CrossRef]
  15. Ma, Y.; Liu, D.; Cai, L. Deep Learning-Based Upper Limb Functional Assessment Using a Single Kinect v2 Sensor. Sensors 2020, 20, 1903. [Google Scholar] [CrossRef]
  16. An, J.; Cheng, X.; Wang, Q.; Chen, H.; Li, J.; Li, S. Human Action Recognition Based on Kinect. J. Phys. Conf. Ser. 2020, 1693, 012190. [Google Scholar] [CrossRef]
  17. Park, B.-K.D.; Jung, H.; Ebert, S.M.; Corner, B.D.; Reed, M.P. Efficient Model-Based Anthropometry under Clothing Using Low-Cost Depth Sensors. Sensors 2024, 24, 1350. [Google Scholar] [CrossRef] [PubMed]
  18. Liu, P.-L.; Chang, C.-C.; Li, L.; Xu, X. A Simple Method to Optimally Select Upper-Limb Joint Angle Trajectories from Two Kinect Sensors during the Twisting Task for Posture Analysis. Sensors 2022, 22, 7662. [Google Scholar] [CrossRef] [PubMed]
  19. Saputra, A.A.; Besari, A.R.A.; Kubota, N. Human Joint Skeleton Tracking Using Multiple Kinect Azure. In Proceedings of the 2022 International Electronics Symposium (IES), Surabaya, Indonesia, 9–11 August 2022. [Google Scholar] [CrossRef]
  20. Ryselis, K.; Petkus, T.; Blažauskas, T.; Maskeliūnas, R.; Damaševičius, R. Multiple Kinect Based System to Monitor and Analyze Key Performance Indicators of Physical Training. Hum.-Centric Comput. Inf. Sci. 2020, 10, 51. [Google Scholar] [CrossRef]
  21. Eichler, N.; Hel-Or, H.; Shimshoni, I. Spatio-Temporal Calibration of Multiple Kinect Cameras Using 3D Human Pose. Sensors 2022, 22, 8900. [Google Scholar] [CrossRef] [PubMed]
  22. Yeung, L.F.; Yang, Z.; Cheng, K.C.; Du, D.; Tong, R.K.Y. Effects of Camera Viewing Angles on Tracking Kinematic Gait Patterns Using Azure Kinect, Kinect v2 and Orbbec Astra Pro v2. Gait Posture 2021, 87, 19–26. [Google Scholar] [CrossRef]
  23. Bilesan, A.; Behzadipour, S.; Tsujita, T.; Komizunai, S.; Konno, A. Markerless Human Motion Tracking Using Microsoft Kinect SDK and Inverse Kinematics. In Proceedings of the 2019 12th Asian Control Conference (ASCC), Kitakyushu, Japan, 9–12 June 2019; pp. 504–509. [Google Scholar]
  24. Bilesan, A.; Komizunai, S.; Tsujita, T.; Konno, A. Improved 3D Human Motion Capture Using Kinect Skeleton and Depth Sensor. J. Robot. Mechatron. 2021, 33, 1408–1422. [Google Scholar] [CrossRef]
  25. Beshara, P.; Anderson, D.B.; Pelletier, M.; Walsh, W.R. The Reliability of the Microsoft Kinect and Ambulatory Sensor-Based Motion Tracking Devices to Measure Shoulder Range-of-Motion: A Systematic Review and Meta-Analysis. Sensors 2021, 21, 8186. [Google Scholar] [CrossRef]
  26. Kurillo, G.; Hemingway, E.; Cheng, M.L.; Cheng, L. Evaluating the Accuracy of the Azure Kinect and Kinect v2. Sensors 2022, 22, 2469. [Google Scholar] [CrossRef]
  27. Büker, L.; Quinten, V.; Hackbarth, M.; Hellmers, S.; Diekmann, R.; Hein, A. How the Processing Mode Influences Azure Kinect Body Tracking Results. Sensors 2023, 23, 878. [Google Scholar] [CrossRef] [PubMed]
  28. Jamali, Z.; Behzadipour, S. Quantitative Evaluation of Parameters Affecting the Accuracy of Microsoft Kinect in Gait Analysis. In Proceedings of the 23rd Iranian Conference on Biomedical Engineering (ICBME), Tehran, Iran, 24–25 November 2016; pp. 306–311. [Google Scholar] [CrossRef]
  29. Martiš, P.; Košutzká, Z.; Kranzl, A. A Step Forward Understanding Directional Limitations in Markerless Smartphone-Based Gait Analysis: A Pilot Study. Sensors 2024, 24, 3091. [Google Scholar] [CrossRef] [PubMed]
  30. Thomas, J.; Hall, J.B.; Bliss, R.; Guess, T.M. Comparison of Azure Kinect and Optical Retroreflective Motion Capture for Kinematic and Spatiotemporal Evaluation of the Sit-to-Stand Test. Gait Posture 2022, 94, 153–159. [Google Scholar] [CrossRef] [PubMed]
  31. Steinebach, T.; Grosse, E.H.; Glock, C.H.; Wakula, J.; Lunin, A. Accuracy Evaluation of Two Markerless Motion Capture Systems for Measurement of Upper Extremities: Kinect V2 and Captiv. Hum. Factors Ergon. Manuf. Serv. Ind. 2020, 30, 315–327. [Google Scholar] [CrossRef]
  32. Antico, M.; Balletti, N.; Laudato, G.; Lazich, A.; Notarantonio, M.; Oliveto, R.; Ricciardi, S.; Scalabrino, S.; Simeone, J. Postural Control Assessment via Microsoft Azure Kinect DK: An Evaluation Study. Comput. Methods Programs Biomed. 2021, 209, 106324. [Google Scholar] [CrossRef] [PubMed]
  33. Milosevic, B.; Leardini, A.; Farella, E. Kinect and Wearable Inertial Sensors for Motor Rehabilitation Programs at Home: State of the Art and an Experimental Comparison. BioMed. Eng. OnLine 2020, 19, 25. [Google Scholar] [CrossRef]
  34. Pfister, A.; West, A.M.; Bronner, S.; Noah, J.A. Comparative Abilities of Microsoft Kinect and Vicon 3D Motion Capture for Gait Analysis. J. Med. Eng. Technol. 2014, 38, 274–280. [Google Scholar] [CrossRef]
  35. Holland, M. Visual Puppeteering Using the Vizualeyez: 3D Motion Capture System. Master’s Thesis, University of Twente, Enschede, The Netherlands, 2018. [Google Scholar]
  36. Oztop, E.; Lin, L.-H.; Kawato, M.; Cheng, G. Extensive Human Training for Robot Skill Synthesis: Validation on a Robotic Hand. In Proceedings of the 2007 IEEE International Conference on Robotics and Automation, Rome, Italy, 10–14 April 2007; pp. 1788–1793. [Google Scholar] [CrossRef]
  37. PTI Phoenix Technologies Inc. Visualeyez III VZ10K/VZ10K5 Trackers. Available online: https://www.ptiphoenix.com/products/trackers/VZ10K_VZ10K5 (accessed on 25 December 2024).
  38. Sosa-León, V.A.L.; Schwering, A. Evaluating Automatic Body Orientation Detection for Indoor Location from Skeleton Tracking Data to Detect Socially Occupied Spaces Using the Kinect v2, Azure Kinect, and Zed 2i. Sensors 2022, 22, 3798. [Google Scholar] [CrossRef] [PubMed]
  39. Cao, Z.; Bao, T.; Jia, W.; Ma, S.; Yuan, J. Towards a More Practical Data-Driven Biped Walking Control. In Proceedings of the 2021 IEEE International Conference on Robotics and Biomimetics (ROBIO), Sanya, China, 27–31 December 2021; pp. 1058–1064. [Google Scholar] [CrossRef]
  40. Cao, Z.; Bao, T.; Ren, Z.; Fan, Y.; Deng, K.; Jia, W. Real-Time Stylized Humanoid Behavior Control through Interaction and Synchronization. Sensors 2022, 22, 1457. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Illustration of the experimental protocol and scenarios. (a) Placement of the two types of motion capture devices, with the default coordinate system of Kinect indicated in red and that of Visualeyez in blue. Red nodes and lines superimposed on the human figure are directly generated from skeletal data acquired by Kinect. (b) Photograph of the testing environment and equipment. (c) Key test parameters for static human pose evaluation, including various pose types, distances, and orientations, all performed using a mannequin. (d) Various motion types used to evaluate motion capture performance under dynamic conditions, performed by human subjects.
Figure 1. Illustration of the experimental protocol and scenarios. (a) Placement of the two types of motion capture devices, with the default coordinate system of Kinect indicated in red and that of Visualeyez in blue. Red nodes and lines superimposed on the human figure are directly generated from skeletal data acquired by Kinect. (b) Photograph of the testing environment and equipment. (c) Key test parameters for static human pose evaluation, including various pose types, distances, and orientations, all performed using a mannequin. (d) Various motion types used to evaluate motion capture performance under dynamic conditions, performed by human subjects.
Sensors 25 01047 g001
Figure 2. Selection from the 32 human tracking joint nodes provided by Azure Kinect DK, with 16 nodes used to reconstruct the human skeletal morphology. For each selected node, the physical marker required for Visualeyez recognition was attached to the corresponding location on the body surface. Considering that this device necessitates grouping every four markers and connecting them to the receiver via a single cable, adjacent numbering was adopted to minimize the constraints imposed by cables on the subject’s movement.
Figure 2. Selection from the 32 human tracking joint nodes provided by Azure Kinect DK, with 16 nodes used to reconstruct the human skeletal morphology. For each selected node, the physical marker required for Visualeyez recognition was attached to the corresponding location on the body surface. Considering that this device necessitates grouping every four markers and connecting them to the receiver via a single cable, adjacent numbering was adopted to minimize the constraints imposed by cables on the subject’s movement.
Sensors 25 01047 g002
Figure 3. Variation in cosine similarity with distance in static tests when the subject faced the Kinect. The figure includes C O S j , C O S H , C O S f H , C O S H U , and C O S H L data. C O S j data are separately visualized for the upper and lower limbs to correspond to the distinct data variation ranges.
Figure 3. Variation in cosine similarity with distance in static tests when the subject faced the Kinect. The figure includes C O S j , C O S H , C O S f H , C O S H U , and C O S H L data. C O S j data are separately visualized for the upper and lower limbs to correspond to the distinct data variation ranges.
Sensors 25 01047 g003
Figure 4. Variation in cosine similarity with orientation in static tests when the subject maintained an arms-outstretched pose at a fixed position. The figure includes C O S j , C O S H , C O S f H , C O S H U , and C O S H L data. C O S j data are separately visualized for the upper and lower limbs to correspond to the distinct data variation ranges.
Figure 4. Variation in cosine similarity with orientation in static tests when the subject maintained an arms-outstretched pose at a fixed position. The figure includes C O S j , C O S H , C O S f H , C O S H U , and C O S H L data. C O S j data are separately visualized for the upper and lower limbs to correspond to the distinct data variation ranges.
Sensors 25 01047 g004
Figure 5. Visualization of skeletal data corresponding to various distances when the subject maintained pre-defined poses and faced the Kinect directly. The left panel shows the frontal view, and the right panel shows the top view. Blue lines correspond to the ground truth data X v , Y v , Z v obtained from the Visualeyez, red lines correspond to the data X ~ k , Y ~ k , Z ~ k obtained from the Kinect, and black lines correspond to the corrected values X f , Y f , Z f .
Figure 5. Visualization of skeletal data corresponding to various distances when the subject maintained pre-defined poses and faced the Kinect directly. The left panel shows the frontal view, and the right panel shows the top view. Blue lines correspond to the ground truth data X v , Y v , Z v obtained from the Visualeyez, red lines correspond to the data X ~ k , Y ~ k , Z ~ k obtained from the Kinect, and black lines correspond to the corrected values X f , Y f , Z f .
Sensors 25 01047 g005
Figure 6. Visualization of skeletal data from a top-down perspective when the subject maintained an arms-outstretched pose while varying their orientation and position. From top to bottom, the subject’s orientation changes from −45° to 45°. The line colors have the same meaning as in Figure 5: blue lines correspond to the ground truth data X v , Y v , Z v , red lines correspond to the raw values X ~ k , Y ~ k , Z ~ k , and black lines correspond to the corrected values X f , Y f , Z f .
Figure 6. Visualization of skeletal data from a top-down perspective when the subject maintained an arms-outstretched pose while varying their orientation and position. From top to bottom, the subject’s orientation changes from −45° to 45°. The line colors have the same meaning as in Figure 5: blue lines correspond to the ground truth data X v , Y v , Z v , red lines correspond to the raw values X ~ k , Y ~ k , Z ~ k , and black lines correspond to the corrected values X f , Y f , Z f .
Sensors 25 01047 g006
Figure 7. Values and standard deviations of X ~ k , Y ~ k , Z ~ k when the subject maintained an arms-outstretched pose. The left panel shows the results for an orientation deviation of 30°, and the right panel shows the results for an orientation deviation of 45°. The values of x ~ k i , y ~ k i , and z ~ k i are represented by blue, green, and red lines, respectively, and their standard deviations are represented by shaded areas of the corresponding colors.
Figure 7. Values and standard deviations of X ~ k , Y ~ k , Z ~ k when the subject maintained an arms-outstretched pose. The left panel shows the results for an orientation deviation of 30°, and the right panel shows the results for an orientation deviation of 45°. The values of x ~ k i , y ~ k i , and z ~ k i are represented by blue, green, and red lines, respectively, and their standard deviations are represented by shaded areas of the corresponding colors.
Sensors 25 01047 g007
Figure 8. Cosine similarity results of dynamic tests, where (a) corresponds to the forward and backward walking test, (b) corresponds to the marching in place test, (c) corresponds to the lateral walking test, and (d) corresponds to the in-place body rotation test. The data frame index corresponding to the data sampling time is used as the horizontal axis, uniformly arranged to facilitate retrieval of data records. In total, 500 frames correspond to a sampling duration of approximately 17 s. The figure includes C O S H , C O S f H , and C O S j data, with C O S j data presented separately for the upper and lower limbs for comparison with static test results.
Figure 8. Cosine similarity results of dynamic tests, where (a) corresponds to the forward and backward walking test, (b) corresponds to the marching in place test, (c) corresponds to the lateral walking test, and (d) corresponds to the in-place body rotation test. The data frame index corresponding to the data sampling time is used as the horizontal axis, uniformly arranged to facilitate retrieval of data records. In total, 500 frames correspond to a sampling duration of approximately 17 s. The figure includes C O S H , C O S f H , and C O S j data, with C O S j data presented separately for the upper and lower limbs for comparison with static test results.
Sensors 25 01047 g008
Table 1. Parameter settings for static experiments.
Table 1. Parameter settings for static experiments.
PoseOrientation
(°)
Distance Range (mm)Distance
Interval (mm)
Number of Sub-Tests
Standing at
attention
0[1500, 4000]10026
±30[1500, 4000]50012
±45[1500, 4000]50012
Arms swinging (Left forward)0[1500, 4000]10026
Arms swinging (Right forward)0[1500, 4000]10026
Arms
Outstretched laterally 1
0[1500, 4000]10026
±15[2000, 4000]50010
±30[2000, 4000]50010
±45[2000, 4000]50010
1 For tests involving the arms outstretched laterally pose with orientations not directly facing Kinect, the minimum distance was set to 2000 mm to prevent wrists from exceeding the recognition range.
Table 2. Parameter settings for dynamic experiments.
Table 2. Parameter settings for dynamic experiments.
Action 1Orientation
(°)
Distance Range (mm)Distance
Interval (mm)
Number of Sub-Tests
Marching in place0[1500, 3500]5005
Forward and backward
walking
0[1500, 3500]-3
Lateral walking0[1500, 3500]5005
In-place body rotation[−15,15]2000-1
1 Each set of experiments was repeated three times for cross-validation and to mitigate potential data anomalies caused by device-related issues.
Table 3. Skeletal weights for whole–body cosine similarity computation.
Table 3. Skeletal weights for whole–body cosine similarity computation.
Skeletal Segment Index (j)Skeletal Segment NameTerminating Node (i)Originating Node (i)Weight (Pelvis Weight = 1)Normalized Weight
( W j )
1Right Shank Ankle_Right (1)Knee_Right (2)100.130
2Right Thigh Knee_Right (2)Hip_Right (7)50.065
3Left Shank Ankle_Left (3)Knee_Left (4)100.130
4Left Thigh Knee_Left (4)Hip_Left (8)50.065
5PelvisPelvis (5)Pelvis (5)10.013
6Lumbar SpineSpine_Naval (6)Pelvis (5)10.013
7Right HipHip_Right (7)Pelvis (5)30.039
8Left HipHip_Left (8)Pelvis (5)30.039
9Left ForearmWrist_Left (9)Elbow_Left (10)100.130
10Left Upper ArmElbow_Left (10)Shoulder_Left (16)50.065
11Right ForearmWrist_Right (11)Elbow_Right (12)100.130
12Right Upper ArmElbow_Right (12)Shoulder_Right (13)50.065
13Right ScapulaShoulder_Right (13)Neck (14)30.039
14Cervical SpineNeck (14)Spine_Chest (15)20.025
15Thoracic SpineSpine_Chest (15)Spine_Naval (6)10.013
16Left ScapulaShoulder_Left (16)Neck (14)30.039
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jia, W.; Wang, H.; Chen, Q.; Bao, T.; Sun, Y. Analysis of Kinect-Based Human Motion Capture Accuracy Using Skeletal Cosine Similarity Metrics. Sensors 2025, 25, 1047. https://doi.org/10.3390/s25041047

AMA Style

Jia W, Wang H, Chen Q, Bao T, Sun Y. Analysis of Kinect-Based Human Motion Capture Accuracy Using Skeletal Cosine Similarity Metrics. Sensors. 2025; 25(4):1047. https://doi.org/10.3390/s25041047

Chicago/Turabian Style

Jia, Wenchuan, Hanyang Wang, Qi Chen, Tianxu Bao, and Yi Sun. 2025. "Analysis of Kinect-Based Human Motion Capture Accuracy Using Skeletal Cosine Similarity Metrics" Sensors 25, no. 4: 1047. https://doi.org/10.3390/s25041047

APA Style

Jia, W., Wang, H., Chen, Q., Bao, T., & Sun, Y. (2025). Analysis of Kinect-Based Human Motion Capture Accuracy Using Skeletal Cosine Similarity Metrics. Sensors, 25(4), 1047. https://doi.org/10.3390/s25041047

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop