Next Article in Journal
Investigation of the JPA-Bandwidth Improvement in the Performance of the QTMS Radar
Next Article in Special Issue
Diffusion Probabilistic Modeling for Video Generation
Previous Article in Journal
Probability Distributions Describing Qubit-State Superpositions
Previous Article in Special Issue
How Much Is Enough? A Study on Diffusion Times in Score-Based Generative Models
 
 
Article
Peer-Review Record

Learning Energy-Based Models in High-Dimensional Spaces with Multiscale Denoising-Score Matching

Entropy 2023, 25(10), 1367; https://doi.org/10.3390/e25101367
by Zengyi Li 1,2,*, Yubei Chen 1,3 and Friedrich T. Sommer 1,4,5
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Entropy 2023, 25(10), 1367; https://doi.org/10.3390/e25101367
Submission received: 8 August 2023 / Revised: 6 September 2023 / Accepted: 18 September 2023 / Published: 22 September 2023
(This article belongs to the Special Issue Deep Generative Modeling: Theory and Applications)

Round 1

Reviewer 1 Report

The manuscript proposes a novel approach to train energy-based models with reasonable computational costs. It is well-written, and the methodology is explained clearly. I can imagine the framework to be useful for various fields such as biomedical engineering, additive manufacturing, etc. Therefore, I suggest acceptance of this manuscript.

Some minor questions and suggestions:
1- A quantitative comparison with ML or a statement explaining how fast this methodology is compared to ML would be welcome.
2- Section 4, line 224, "We found that....", what is large enough? A quantitative statement would be welcome.
3-The same section, line 226, How do large or small noise levels destabilize the training process?

Author Response

Both reviewers raised questions about the choice of noise scales, especially on how to choose the maximum noise scale. Our understanding is that the maximum noise scale needs to be large enough so that traveling between different “modes” of the data distribution is easy. For example, for an NxN image with 3 channels each with value normalized to between [0,1], the maximum L2 distance between 2 images are sqrt(N*N*3). If we apply Gaussian noise of variance 1, to each of the pixels of each channel, the average L2 length of the noise vector will also be sqrt(N*N*3), which can easily cover the longest distance between images. In our paper, we chose a maximum noise scale of 1.2 to make sure we cover this distance well.

Reviewer 2:

On page 6 line 222, we stated that MDSM is up to 10x faster in real run-time, and we provided detailed run-time data in the footnote. We understand that this important result can be easily missed, therefore we have added emphasis to the related statement.

 

On your question about how small or large noise destabilizes the training process: More precisely speaking, training with too large noise does not destabilize the training, but it also does not help in improving the results, the reason is the same as why we choose 1.2 as the maximum noise scale, please see our explanation above. Training with only a very small noise scale can cause numerical instability, we didn’t investigate this phenomenon deeper since we always need to train with larger noise scales to obtain reasonable results, but it could be due to gradient vanishing/explosion.

 

Reviewer 2 Report

see attachment.

Comments for author File: Comments.pdf

see attachment.

Author Response

Both reviewers raised questions about the choice of noise scales, especially on how to choose the maximum noise scale. Our understanding is that the maximum noise scale needs to be large enough so that traveling between different “modes” of the data distribution is easy. For example, for an NxN image with 3 channels each with value normalized to between [0,1], the maximum L2 distance between 2 images are sqrt(N*N*3). If we apply Gaussian noise of variance 1, to each of the pixels of each channel, the average L2 length of the noise vector will also be sqrt(N*N*3), which can easily cover the longest distance between images. In our paper, we chose a maximum noise scale of 1.2 to make sure we cover this distance well.

 

Reviewer 1:

On your question about the performance of our method relative to others. Although this method is not the state-of-the-art generative model, but by the time of its proposal, it is the fastest way to train an energy-based model while being able to generate high quality samples. As explained in the text, energy-based models offers some unique advantages not covered by other generative models, therefore we believe that this method has its unique value.

 

On your question about the length distribution, we would like to explain it better here: the length distribution of random vector referrers to the length of the added noise vector, if you view the random noise added to the image as a single vector instead of many independent samples. More specifically, say for an image of size N*N, we add independent Gaussian noise of variance 1 to each pixel, then we have added N*N noise samples. If we view the N*N samples as one vector of size N*N, the vector will have a high probability of of having L2 length of sqrt(N*N), since the variance of each sample is 1.

 

For your question about the proof in lines 372-377, the argument in line 372-377 does not depend on the fact that the noise added is Gaussian. For high-dimensional data, the denoised distribution for any noise should be pretty concentrated around the clean sample.

 

“ energy value at clean data points may not always be well-behaved” in line 290 means that the energy value could take erratic values if you evaluate it at the exactly the clean sample location, the energy value is much more regular if a tiny amount of noise is added to the clean samples. The reason could be that during training, the model never actually experiences any clean sample, all input to model has some degree of noise added. 

 

If the sampling process gets shorter and shorter, certainly the sample quality would eventually be degraded. However, we did not investigate in detail if increasing the sampling steps from the current setting would improve the sample quality further.

 

Your observation on Fig.B.1 is exactly right. Based on our geometrical interpretation, denoising larger noise requires the model to learn longer-range correlations in the data distribution, which explains why model trained with larger single noise scale generate samples with more consistent shape, while model trained with smaller noise scale local generate random strokes. We have added further explanations to this figure to aid understanding.

 

Thanks for your suggestion on typos, we have performed one more round of proof-reading to improve the manuscript further.

Round 2

Reviewer 1 Report

The authors have answered  my questions.

Back to TopTop