Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Skeleton Tracking Accuracy and Precision Evaluation of Kinect V1, Kinect V2, and the Azure Kinect

Appl. Sci. 2021, 11(12), 5756; https://doi.org/10.3390/app11125756

by Michal Tölgyessy^*

, Martin Dekan and Ľuboš Chovanec

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Appl. Sci. 2021, 11(12), 5756; https://doi.org/10.3390/app11125756

Submission received: 21 May 2021 / Revised: 11 June 2021 / Accepted: 15 June 2021 / Published: 21 June 2021

(This article belongs to the Special Issue Social Robotics and Human-Robot Interaction (HRI))

Round 1

Reviewer 1 Report

on the page 4, 111-112^th row: Was the camera calibrated before the experiments? The principal point was calculated by calibration or the center of the image was used?

How was the measured point marked? In determining the distance between position, there was the orientation of the plate taken into account?

on the page 5, figure 4 - What is the measuring range of the sensor? The experiment was probably limited by the measuring range of the manipulator, but it would be interesting to test the results at a greater distance. Are the 1500 and 4500mm positions given by the measuring range of the sensor?

on the page 9, figures 9 -12: The graphs are illegible, the lines are very close together and the colors very similar. A better option would be a bar graph with multiple columns (for each node point) for each position of the figurine.

on the page 15, 246^th row: Greatest instability at the end of limbs can be caused by orientation of these part of figurine. There is no information on how to detect specific points on the body. Also effect clothing on detection (loose clothing vs. close-fitting clothing).

Author Response

Reviewer

on the page 4, 111-112^th row: Was the camera calibrated before the experiments?

We used default camera calibration.

The principal point was calculated by calibration or the center of the image was used?

Center of the IR and depth image was used.

How was the measured point marked?

The measured point was an iron screw placed in the middle of the plate.

In determining the distance between position, there was the orientation of the plate taken into account?

We oriented the plate so that it was perpendicular to the Z axis of each sensor. For more information please see [1].

[1] Tölgyessy, M.; Dekan, M.; Chovanec, Ľ.; Hubinský, P. Evaluation of the Azure Kinect and Its Comparison to Kinect V1 and Kinect V2. Sensors 2021, 21, 413. https://doi.org/10.3390/s21020413

The 1500 mm position was limited by sensors – we wanted to see as many joints in each sensor as possible. The final position was limited by the robotic manipulator. Only the Azure Kinect would give at least some data behind this position, but the manipulator couldn’t reach any further.

The problem is that the Azure Kinect tracks 32 joints; this would have resulted in 256 bars (even less clear solution), or 8 separate figures (too cluttering for the whole paper). In the current graph version it can be seen when some joint has reasonably worse results, plus the imaginary envelope of the graph demonstrates the degree and character of the error.

For joint detection we used binaries developed by Microsoft based on the work of Shotton et. al.[1]. In terms of clothing type – the looser the clothing the worse detection. We used the tightest clothing possible because loose clothing only worsens joint detection, as it hides limb curves.

[1] Shotton J., Girshick R., Fitzgibbon A., Sharp T., Cook M., Finocchio M., Moore R., Kohli P., Criminisi A., Kipman A, and Blake A. 2013. Efficient human pose estimation from single depth images. In Proceedings of the IEEE transactions on pattern analysis and machine intelligence, 2821–2840

Reviewer 2 Report

Dear Authors,

Based on the first round review of the manuscript entitled Skeleton Tracking Accuracy and Precision Evaluation of Kinect V1, Kinect V2, and the Azure Kinect, the reviewer has the following comments:

Please explain about contribution in introduction, clear.
In the last paragraph of introduction, paper organization is needed.
How the location of camera can be optimized?
Why S.D is used to test the precision?
Please more explain about Figures 4 and 5.
To proof reliability and robustness which technique is recommended by authors?
Figures 19-23 needs more explain.
Tables 3 and 4 are not clear. please more explain

Regards

Author Response

Reviewer

Based on the first round review of the manuscript entitled Skeleton Tracking Accuracy and Precision Evaluation of Kinect V1, Kinect V2, and the Azure Kinect, the reviewer has the following comments:

Please explain about contribution in introduction, clear.

It’s on line 66:

The main novelty of this paper lies in skeleton tracking comparison of all three Kinect versions and in using a human-sized figurine precisely positioned by a robotic manipulator with focus on the new Azure Kinect. Evaluation of the new sensor in this regard is very useful for scientists and researchers developing designs and applications where joint tracking without wearable equipment is needed.

In the last paragraph of introduction, paper organization is needed.

Was added.

How the location of camera can be optimized?

What optimization do you mean? All sensors were positioned as close as possible to the figurine in the first position (so that as many joints as possible were detected). It was also placed high above the ground, because the figurine was not standing but floating in the air to ensure same orientation towards the sensor.

Why S.D is used to test the precision?

Based on our experiments the Kinect noise had normal distribution, that is why we chose STD.

Please more explain about Figures 4 and 5.

For general accuracy and precision measurements, we mounted a white reflective plate to the end effector of a robotic manipulator — ABB IRB4600, moved the plate along the depth axis of each sensor, and performed depth measurements in 7-8 discrete positions (Figure 3) where distance between adjacent positions was 400 mm.

To proof reliability and robustness which technique is recommended by authors?

What reliability and robustness do you mean? In Table 2 we evaluated robustness of joint detection based on the number of undetected joints plus there is whole section where we evaluate the body joint reliability using ellipsoids – please see your next question.

Figures 19-23 needs more explain.

In Figures 19-23 there are figurine body joint data from position 5. Size of each ellipsoid represents standard deviation of particular joint 3D position multiplied by 10 for better visual clarity.

Tables 3 and 4 are not clear. please more explain

In Table 3 are joint distances between individual positions where the expected result is 400 mm. For each sensor and respective mode, there are three rows. First row is the average distance change on all joints; second row is the minimal average distance change of a joint (the one with lowest value); third row is the maximal average distance change of a joint (the one with highest value). In Table 4 are noise values expressed by standard deviation. As in Table 3 we present the average noise of all joints, maximal, and minimal noise for each respective position.

Reviewer 3 Report

Please provide snapshots of the figurine detection at different positions. It becomes easier to understand what is meant by "joints not being detected".

The precision or measurement accuracy should be very closely related to the number of measurements. How many measurements were performed in each case?

Overall, I feel the conclusion that the Azure connect NFOV is much better than Kinect 2 is not correct. The authors fail to clearly support this conclusion with evidence. In many cases the Kinect 2 measurement standard deviations are comparable to Azure Connect, particularly from the figurine measurements. The authors have to do a much better job to quantify the performance parameters which shows without any reasonable doubt that the Azure connect NFOV is better than its predecessors.

Author Response

Reviewer

Please provide snapshots of the figurine detection at different positions. It becomes easier to understand what is meant by "joints not being detected".

We have included a figure to explain this issue.

The precision or measurement accuracy should be very closely related to the number of measurements. How many measurements were performed in each case?

500

Yes, the reviewer is right (we have modified our conclusions and we are very thankful for this observation), the measured standard deviations are similar. However, from Figure 18 it is clear that Azure NFOV is much more stable in terms of accuracy than Kinect 2. Furthermore, the range of the Azure Kinect is higher; it was the only sensor that could capture the figurine even behind the manipulator range. Plus the WFOV mode has wider angular resolution than Kinect 2.

Round 2

Reviewer 2 Report

Dear Authors,

Regarding the 2nd round review of this manuscript. It can be accepted for further processing.

Regards

Reviewer 3 Report

Thank you for answering my comments

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

Article Menu

Skeleton Tracking Accuracy and Precision Evaluation of Kinect V1, Kinect V2, and the Azure Kinect

Further Information

Guidelines

MDPI Initiatives

Follow MDPI