Next Article in Journal / Special Issue
Neural Vocoding for Singing and Speaking Voices with the Multi-Band Excited WaveNet
Previous Article in Journal
Automatic Hemiplegia Type Detection (Right or Left) Using the Levenberg-Marquardt Backpropagation Method
Previous Article in Special Issue
Adaptive Feature Pyramid Network to Predict Crisp Boundaries via NMS Layer and ODS F-Measure Loss Function
 
 
Article
Peer-Review Record

A Bottleneck Auto-Encoder for F0 Transformations on Speech and Singing Voice

Information 2022, 13(3), 102; https://doi.org/10.3390/info13030102
by Frederik Bous * and Axel Roebel *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Information 2022, 13(3), 102; https://doi.org/10.3390/info13030102
Submission received: 24 January 2022 / Revised: 14 February 2022 / Accepted: 18 February 2022 / Published: 23 February 2022
(This article belongs to the Special Issue Signal Processing Based on Convolutional Neural Network)

Round 1

Reviewer 1 Report

Paper is of good quality. However some signal processing methods which extract F0 is missing for example,

  1. A comparative study of pitch extraction algorithms on a large variety of singing sounds,” in ICASSP, 2013.

  2. Estimation of Fundamental Frequency From Singing Voice using Harmonics of Impulse-like Excitation Source”, in INTERSPEECH, 2018.

-- Illustration with block diagram of the proposed study gives more clarity for the reader.

-- Captions of figures and tables needs to be more clear, and which should contain all the relevant information for understanding.

Author Response

"Please see the attachment." 

Author Response File: Author Response.pdf

Reviewer 2 Report

The paper is relatively well written. I have only several comments as follows:

  • It might be better to provide corresponding F0 ratios when mentioning the transposition and/or F0 error in cents.
  • The statements provided in the experiments put focus on comparing PaN vocoder and the proposed auto-encoder framework. However, to synthesize the waveform, the latter is dependent on the mel-spectrogram inverter, e.g., neural vocoder. Therefore, the performance of the proposed system depends on both components. How can it be justified that the PaN vocoder and the auto-encoder framework can be compared apple-to-apple?
  • In the subjective evaluation, what were the exact instructions given to the listeners to evaluate the mean opinion score for quality? This is because the quality can refer only to the audio signal quality or it can also refer to the naturalness in prosody considering that F0 manipulation is performed. For instance, the signal quality may not be that high, but the audio can be perceived as sounded natural as produced by human because of the matching F0 transposition and spectral envelope, otherwise it will sound unnatural.
  • Regarding the last point, it also seems that from the provided samples, none of the systems can produce natural speech/singing sounds when the F0 is transposed to lower values. It looks that there are works to be done to also explicitly modify the vocal tract function to match the F0 modification.
  • Lastly, can the concept of the proposed system in this paper be easily applied to other neural network structure, as it looks that the hyperparameters, especially bottleneck conditions, will highly depend on the network architecture? In other words, if we want to apply this concept to other network structure, what aspects can be taken and adopted seamlessly?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop