Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Illumination and Shadows in Head Rotation: Experiments with Denoising Diffusion Models

Electronics 2024, 13(15), 3091; https://doi.org/10.3390/electronics13153091

by Andrea Asperti^*

, Gabriele Colasuonno

and Antonio Guerra

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Electronics 2024, 13(15), 3091; https://doi.org/10.3390/electronics13153091

Submission received: 22 June 2024 / Revised: 28 July 2024 / Accepted: 31 July 2024 / Published: 5 August 2024

(This article belongs to the Special Issue Generative AI and Its Transformative Potential)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Title: Head Rotation with Denoising Diffusion Models

This study explores the capabilities of Denoising Diffusion Models (DDM) in deep generative modeling, specifically targeting face rotation. The research leverages a recent embedding technique for Denoising Diffusion Implicit Models (DDIM) to achieve significant face rotations of ±30 degrees while preserving individual characteristics. The approach does not require additional training of the generative model; it simply computes and follows specific trajectories in the pre-trained model's latent space, demonstrating its potential and limitations through qualitative examples. The study is straightforward; however, the following comments should be addressed before this manuscript can be considered for publication.

1. The authors should update the title of this study to correctly reflect the content of this manuscript. “Head Rotation with Denoising Diffusion Models” does not highlight the innovative use of DDM for face rotation and the method's key aspects for capturing the essence of the research.

2. Methodology section 6, particularly the linear regression approach for computing trajectories in the latent space, needs more detailed explanations. First, the authors should provide the justification behind adopting the linear regression for computing trajectories. Also, Figure 8 should be improved with X and Y axis labels.

3. Although the authors mentioned the technical comparison is provided in section 8 (line 468), the comparative evaluation with other state-of-the-art methods for face rotation is not provided in section 8.

4. The author claimed that the traditional metrics used in generative modeling cannot be used in this context (line 589), but this study still lacks a robust quantitative analysis. The authors should either consider justifying or providing the metrics and benchmarks to evaluate the performance of the proposed method.

5. The conclusion of the study needs enhancement to better reflect its findings and implications. It should clearly summarize the key outcomes, emphasizing the effective use of DDM for face rotation and the achieved results. The conclusion should also discuss the broader implications for deep generative modeling, including potential applications and advancements. Finally, a detailed discussion of future research directions and improvements should be included to suggest practical steps for further exploration.

Author Response

Please, see the file in attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This paper focuses on the head/face rotation by leveraging the denoising diffusion models. Experimental results on existing dataset show the performance of the proposed method.

1. The differences between the proposed architecture and U-Net would be explained more. And it would be nice to summarize the novelty point by point at the end of the introduction part.

2. In Figure 2, it is suggested to use some real images instead of only the blocks for illustrating the framework.

3. There lack the comparisons with state-of-the-art models in the experimental section.

4. Some encoder-decoder based methods in other image processing tasks are recommended to be reviewed, such as Binocular rivalry oriented predictive autoencoding network for blind stereoscopic image quality measurement.

5. Please improve the presentation quality of the paper. For example, it is better to unify some statements, e.g., head rotation or face rotation.

Comments on the Quality of English Language

N/A

Author Response

Please, see the file in attachment

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

Thanks for the authors' response. However, my comments have not been addressed well.

As the authors state, the novelty is not the network structure, then the novelty is minor in my opinion. The motivations should be clearer: why the network can solve this issue? Why use the pretrained model? And even there are no summary of the main contributions in the introduction.

The real images mean that you can use some images to replace the boxes. For example, in Figure 2, the noisy image is just a green box instead of a real noisy image, making the work less readable. And the noise variance is also green, which seems incorrect.

Considering that the work lacks the comparisons with SOTA, the results are less convincing to some extent. And you do not need to use the other encoder-decoder networks for training or generation. It is just for literature review.

Comments on the Quality of English Language

N/A

Author Response

Comment 1: As the authors state, the novelty is not the network structure, then the novelty is minor in my opinion. The motivations should be clearer: why the network can solve this issue? Why use the pretrained model? And even there are no summary of the main contributions in the introduction.

Answer: The work is about exploring trajectories in the latent space of generative models, that is a major topic of Representation Learning. The exploration of trajectories PRESUPPOSES working with a predefined model, shaping the structure of the latent space. The fact that rotation can be addressed without requiring ad-hoc models of network is in fact one of the interesting point of the work. We tried to better explain these point n the introduction.

Comment 2: The real images mean that you can use some images to replace the boxes. For example, in Figure 2, the noisy image is just a green box instead of a real noisy image, making the work less readable. And the noise variance is also green, which seems incorrect.

Answer: We replaced the noisy image and the corresponding output with examples. The noise variance is green because it is an input of the model: the network knows how much noise it is supposed to remove.

Round 3

Reviewer 2 Report

Comments and Suggestions for Authors

Thanks for the response. The third comment is missing to address.

Comments on the Quality of English Language

N/A

Author Response

Comment: Considering that the work lacks the comparisons with SOTA, the results are less convincing to some extent. And you do not need to use the other encoder-decoder networks for training or generation. It is just for literature review.

Answer: Thank you for your remark. We agree with the reviewer that the lack of comparison is reducing the relevance of the work. We already explained the reasons preventing from addressing this point. We do no understand the reference to the "other" encoder-decoder network. We described the architecture of the only network that we used for our experiments.

Article Menu

Illumination and Shadows in Head Rotation: Experiments with Denoising Diffusion Models

Further Information

Guidelines

MDPI Initiatives

Follow MDPI