Next Article in Journal
Special Issue on Bio-Inspired Algorithms for Image Processing
Next Article in Special Issue
Predicting Intentions of Pedestrians from 2D Skeletal Pose Sequences with a Representation-Focused Multi-Branch Deep Learning Network
Previous Article in Journal
Deep Feature Learning with Manifold Embedding for Robust Image Retrieval
Previous Article in Special Issue
Experimenting the Automatic Recognition of Non-Conventionalized Units in Sign Language
 
 
Article
Peer-Review Record

Generative Model for Skeletal Human Movements Based on Conditional DC-GAN Applied to Pseudo-Images

Algorithms 2020, 13(12), 319; https://doi.org/10.3390/a13120319
by Wang Xi 1, Guillaume Devineau 2, Fabien Moutarde 2,* and Jie Yang 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Algorithms 2020, 13(12), 319; https://doi.org/10.3390/a13120319
Submission received: 4 November 2020 / Revised: 22 November 2020 / Accepted: 27 November 2020 / Published: 3 December 2020
(This article belongs to the Special Issue Algorithms for Human Gesture, Activity and Mobility Analysis)

Round 1

Reviewer 1 Report

The paper is consistent with an interesting application. There are some improvements that can be made.

Loss function choice: what are the arguments for this choice? Would a different loss function work? If not, why?

The concepts of real and fake images in the case of this particular study must be defined before usage to particularize for the problem-at-hand. It would lead to a more rounded definition of the evaluation procedure (what compares and why). Also, the evaluation method must specify is maximization or minimization of the FID is pursued.

The first sentence of section 4.1. appears to be unfinished.

Unclear how the visual match of the qualitative analysis was made. Was it a comparison with pre-existing images or simply an observation of the results in figure 3? This phrase "visually with the condition label input in Generator" is confusing, it seems to point to the former.

Figure 2 is extracted as is from the referenced paper. Copyright issues must be tended to before usage. Unclear how the deconvolution procedure was modified in section 3.4 Perhaps it would be better to replace the current figure 2 with one depicting the actual steps of the modified network, such that the new class-conditioned generative model proposed is clearly presented.

English is generally good. Some minor check should be in order throughout the paper on some expressions, for instance: "up to our knowledge" is not correct. The phrase is "to the best of our knowledge" or "to the extent of our knowledge." Or in "we can drive the same conclusion" should be "derive".

Small text is difficult to read on plots. The images with the skeletal poses over time are very small in some cases and it is difficult to follow them with the faded out colors; perhaps some more contrast will help.

Author Response

We thank the reviewer for raising several clarity issues in our initial manuscript.
We have tried our best to make our new version clearer on every one of these points. N.B: All our modifications can easily be spotted as we have used the "track changes" mode of Word to perform the modifications.

Point 1: Loss function choice: what are the arguments for this choice? Would a different loss function work? If not, why?

Response 1: The loss function we have used is the *standard* loss function used in all types of GAN.
We have made this clearer by the reformulation of the first sentence of §3.5 (line 228).

 

Point 2: The concepts of real and fake images in the case of this particular study must be defined before usage to particularize for the problem-at-hand. It would lead to a more rounded definition of the evaluation procedure (what compares and why). Also, the evaluation method must specify is maximization, or minimization of the FID is pursued.

Response 2: In our study, the real images are the TSSI pseudo-images corresponding to the skeletal sequences from the NTU RGBD dataset used for training. And the fake images are the pseudo-images generated by our model.
The goal of a generative model is to be able to generate fake data as similar as possible (statistically speaking) to the training real data. The respective distributions of fake images and real images should therefore be as close as possible, which implies that minimization of the FID is pursued.

We have made all this clearer by reformulation in lines 259 to 262 of §3.7. We also have added a new section 3.6 to explain in detail how generated skeletons sequences are obtained from the generated fake pseudo-images.

 

Point 3: The first sentence of section 4.1. appears to be unfinished.

Response 3: This sentence should be "We first analyze qualitatively the results of our generative model".
The word "model" has been added to the sentence (section 4.1, line 272)

 

Point 4: Unclear how the visual match of the qualitative analysis was made. Was it a comparison with pre-existing images or simply an observation of the results in figure 3? This phrase "visually with the condition label input in Generator" is confusing, it seems to point to the former.

Response 4: To make it clearer, we have added in figure 3, for comparison, typical examples of a real training sequence for each action class. Actually, our qualitative evaluation is based both on comparing with "real sequence of action", and on simply observing that the skeletons sequences generated by our model look realistic and smooth, and do correspond to the action class fed in the conditional generator as "condition label".

Besides adding for comparison in Figure 3 a typical real sequence for each action class, we also have reformulated the first paragraph of section 4.1 (lines 272 to  280).

 

Point 5: Figure 2 is extracted as is from the referenced paper. Copyright issues must be tended to before usage. Unclear how the deconvolution procedure was modified in section 3.4 Perhaps it would be better to replace the current figure 2 with one depicting the actual steps of the modified network, such that the new class-conditioned generative model proposed is clearly presented.

Response 5: As suggested, we have chosen to replace the architecture illustrated in Figure 2, with a new one on which the modification appears as "Upsample+Conv" labels between layers of the generator. As for the details of our modification of the deconvolution procedure into Upsample+Conv, a graphical illustration would not be very clear and are better understandable from the pseuso-code we provide in Appendix A.

 

Point 6: The English is generally good. Some minor check should be in order throughout the paper on some expressions, for instance: "up to our knowledge" is not correct. The phrase is "to the best of our knowledge" or "to the extent of our knowledge." Or in "we can drive the same conclusion" should be "derive".

Response 6: Thank you for pointing out this expression error and typo. We have corrected these issues (respectively on lines 21 and 303)

 

Point 7: Small text is difficult to read on plots. The images with the skeletal poses over time are very small in some cases and it is difficult to follow them with the faded out colors; perhaps some more contrast will help.

Response 7: Yes, we have now enlarged and added more contrast to the corresponding pictures (Figures 3, 4, 8, 9 and 13).

Author Response File: Author Response.docx

Reviewer 2 Report

In this manuscript, the authors present a new model for the generation of skeletons corresponding to human movements. They use Deep Convolutional Generative Adversarial Network (DC-GAN) and pseudo-image representation of skeleton sequences - Tree Structure Skeleton Image. They evaluate the proposed approach qualitatively and quantitatively using the large NTU RGB+D public dataset. Here are my comments and questions:

  1. One of the goals is to achieve the natural variability of gesture execution. The particular actions have variable lengths, so you resize the corresponding pseudo-images to the fixed size of 64 x 64 pixels. So each movement is linearly scaled in the time domain to 64. Could you explain how do you set the length of the output skeleton sequences?
  2. Is it possible to compare the skeletons obtained by your approach with those generated by some alternative methods?
  3. The joints numbers in figures 1c and 1d are hard to read.
  4. Do you use in training all skeletal data available in NTU RGB+D?

Author Response

We thank the reviewer for his valuable questions and comments, and have tried our best to address all of them in the new version of our manuscript (modified using "Track change" mode of Word, so that spotting the modification is easier).

Point 1: One of the goals is to achieve the natural variability of gesture execution. The particular actions have variable lengths, so you resize the corresponding pseudo-images to the fixed size of 64 x 64 pixels. So each movement is linearly scaled in the time domain to 64. Could you explain how do you set the length of the output skeleton sequences?

Response 1: The process of restoration is symmetric with the one when we prepare the input data. First, the generated 64x64x3 pseudo-image is resized into 100x49x3 (time, joints in TSSI order, XYZ), using bilinear interpolation. Then, for the joints which are repeated in TSSI order, we take the average value of them. For example, the third joint (neck) has repeated twice in TSSI order (the third and the fourth). We calculate the average of these two values and take it as the value of the third joint of the skeleton. In this way, the shape of data is restored to 100x25x3 (time, joints, XYZ), with which we can visualize the sequences of actions. The choice of 100 as time duration is made because the original lengths of sequences in dataset NTU RGB+D vary around 100 frames. The type of variability that we claim our generative model is able to reproduce is rather the "style" of action execution, more than its duration.

In our new version of the manuscript, we have added a new section 3.6 to explain in detail how we transform the generated pseudo-images into generated skeletons sequences, which all have 100 timesteps. We have also reformulated with more details in §3.3 how we transform original skeletons sequences of varying length into TSSI fixed-size pseudo-images.

 

Point 2: Is it possible to compare the skeletons obtained by your approach with those generated by some alternative methods?

Response 2: It is hard to compare either qualitatively or quantitatively.

  • Qualitatively: several published articles report on similar research on the same dataset as ours, and provide skeletons figures (for example: Kiasari, M. A.; Moirangthem, D. S. and Lee, M. "Human action generation with generative adversarial networks." abs/1805.10416 (2018)). However, none of them provide their source code, which could have allowed us to perform a pertinent comparisons between skeletons sequences generated by our model with those generated by other methods.
  • Quantitatively: We currently have not found any published research providing FID metrics for their generator on same dataset. Furthermore, since FID in our case is comparing generated TSSI pseudo-images with TSSI for real skeletons sequences, it could make sense to compare our FID values only with another *TSSI-based* generative model, which does not yet exist (to the best of our knowledge). 

 

Point 3: The joints numbers in figures 1c and 1d are hard to read.

Response 3: We have replaced the illustration of Figure 1 with a different one, which is actually clearer and more easily readable.

 

Point 4:Do you use in training all skeletal data available in NTU RGB+D?

Response 4: Yes, we do use for training all 3D skeletal sequences available in the NTU_RGB+D dataset, except for some samples which have missing data. The corresponding entire dataset is used at each epoch during training.

We have now stated this explicitly in section 3.3 (line 184)

Author Response File: Author Response.docx

Back to TopTop