Illumination and Shadows in Head Rotation: Experiments with Denoising Diffusion Models
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsTitle: Head Rotation with Denoising Diffusion Models
This study explores the capabilities of Denoising Diffusion Models (DDM) in deep generative modeling, specifically targeting face rotation. The research leverages a recent embedding technique for Denoising Diffusion Implicit Models (DDIM) to achieve significant face rotations of ±30 degrees while preserving individual characteristics. The approach does not require additional training of the generative model; it simply computes and follows specific trajectories in the pre-trained model's latent space, demonstrating its potential and limitations through qualitative examples. The study is straightforward; however, the following comments should be addressed before this manuscript can be considered for publication.
1. The authors should update the title of this study to correctly reflect the content of this manuscript. “Head Rotation with Denoising Diffusion Models” does not highlight the innovative use of DDM for face rotation and the method's key aspects for capturing the essence of the research.
2. Methodology section 6, particularly the linear regression approach for computing trajectories in the latent space, needs more detailed explanations. First, the authors should provide the justification behind adopting the linear regression for computing trajectories. Also, Figure 8 should be improved with X and Y axis labels.
3. Although the authors mentioned the technical comparison is provided in section 8 (line 468), the comparative evaluation with other state-of-the-art methods for face rotation is not provided in section 8.
4. The author claimed that the traditional metrics used in generative modeling cannot be used in this context (line 589), but this study still lacks a robust quantitative analysis. The authors should either consider justifying or providing the metrics and benchmarks to evaluate the performance of the proposed method.
5. The conclusion of the study needs enhancement to better reflect its findings and implications. It should clearly summarize the key outcomes, emphasizing the effective use of DDM for face rotation and the achieved results. The conclusion should also discuss the broader implications for deep generative modeling, including potential applications and advancements. Finally, a detailed discussion of future research directions and improvements should be included to suggest practical steps for further exploration.
Author Response
Please, see the file in attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper focuses on the head/face rotation by leveraging the denoising diffusion models. Experimental results on existing dataset show the performance of the proposed method.
1. The differences between the proposed architecture and U-Net would be explained more. And it would be nice to summarize the novelty point by point at the end of the introduction part.
2. In Figure 2, it is suggested to use some real images instead of only the blocks for illustrating the framework.
3. There lack the comparisons with state-of-the-art models in the experimental section.
4. Some encoder-decoder based methods in other image processing tasks are recommended to be reviewed, such as Binocular rivalry oriented predictive autoencoding network for blind stereoscopic image quality measurement.
5. Please improve the presentation quality of the paper. For example, it is better to unify some statements, e.g., head rotation or face rotation.
Comments on the Quality of English Language
N/A
Author Response
Please, see the file in attachment
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThanks for the authors' response. However, my comments have not been addressed well.
As the authors state, the novelty is not the network structure, then the novelty is minor in my opinion. The motivations should be clearer: why the network can solve this issue? Why use the pretrained model? And even there are no summary of the main contributions in the introduction.
The real images mean that you can use some images to replace the boxes. For example, in Figure 2, the noisy image is just a green box instead of a real noisy image, making the work less readable. And the noise variance is also green, which seems incorrect.
Considering that the work lacks the comparisons with SOTA, the results are less convincing to some extent. And you do not need to use the other encoder-decoder networks for training or generation. It is just for literature review.
Comments on the Quality of English Language
N/A
Author Response
Comment 1: As the authors state, the novelty is not the network structure, then the novelty is minor in my opinion. The motivations should be clearer: why the network can solve this issue? Why use the pretrained model? And even there are no summary of the main contributions in the introduction.
Answer: The work is about exploring trajectories in the latent space of generative models, that is a major topic of Representation Learning. The exploration of trajectories PRESUPPOSES working with a predefined model, shaping the structure of the latent space. The fact that rotation can be addressed without requiring ad-hoc models of network is in fact one of the interesting point of the work. We tried to better explain these point n the introduction.
Comment 2: The real images mean that you can use some images to replace the boxes. For example, in Figure 2, the noisy image is just a green box instead of a real noisy image, making the work less readable. And the noise variance is also green, which seems incorrect.
Answer: We replaced the noisy image and the corresponding output with examples. The noise variance is green because it is an input of the model: the network knows how much noise it is supposed to remove.
Round 3
Reviewer 2 Report
Comments and Suggestions for AuthorsThanks for the response. The third comment is missing to address.
Comments on the Quality of English LanguageN/A
Author Response
Comment: Considering that the work lacks the comparisons with SOTA, the results are less convincing to some extent. And you do not need to use the other encoder-decoder networks for training or generation. It is just for literature review.
Answer: Thank you for your remark. We agree with the reviewer that the lack of comparison is reducing the relevance of the work. We already explained the reasons preventing from addressing this point. We do no understand the reference to the "other" encoder-decoder network. We described the architecture of the only network that we used for our experiments.