Next Article in Journal
The Effect of Hot Isostatic Pressing on Surface Integrity, Microstructure and Strength of Hybrid Metal Injection Moulding, and Laser-Based Powder Bed Fusion Stainless-Steel Components
Previous Article in Journal
Interactive Machine Learning-Based Multi-Label Segmentation of Solid Tumors and Organs
 
 
Article
Peer-Review Record

Effects of Sinusoidal Model on Non-Parallel Voice Conversion with Adversarial Learning

Appl. Sci. 2021, 11(16), 7489; https://doi.org/10.3390/app11167489
by Mohammed Salah Al-Radhi 1,*, Tamás Gábor Csapó 1,2 and Géza Németh 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2021, 11(16), 7489; https://doi.org/10.3390/app11167489
Submission received: 16 July 2021 / Revised: 12 August 2021 / Accepted: 12 August 2021 / Published: 15 August 2021

Round 1

Reviewer 1 Report

This paper examines Voice Conversion (VC), i.e., automatic transforming of a voice to the speaking style of a different speaker, keeping linguistic information unchanged. The paper notes that the approach using common parallel data is expensive.  The authors propose a method that allows a non-parallel many-to-many voice conversion using a generative adversarial network (GAN).  They use a sinusoidal model with continuous parameters.

It requires only several minutes of training and no time alignment, and uses the publicly available CSTR VCTK 24 corpus. They claim state-of-the-art results.

 

The work described here likely deserves publication.  The authors appear to be fully aware of the field, and pursue valid and relevant experiments. On the other hand, I cite below many poor aspects of the presentation, where readers would have difficulty following what is being done.  In addition, evaluating the performance of the task is difficult, i.e., judging how well a voice is “converted” to another would appear to be highly subjective, i.e., different listeners are likely to have widely varying opinions.  At the least, the authors should make serious efforts to address my many comments below.

 

 

Specific points:

 

..necessary to correct the VC approach. ->

..necessary to modify the VC approach.

 

..that modifies the acoustic feature .. ->

..that modifies acoustic features ..

 

Simply delete: “Over the time, …”

 

..speak different native languages or … - who say “native”? What would be different in non-native?

 

..been prepared to arise .. ->

..been done to use ..

 

..of non-parallel VC system .. ->

..of non-parallel VC systems ..

 

..the characteristics of the speech analysis … is an act .. - rephrase;  in general, the text is very well written, but an occasional phrase like this one is wrong

 

..to model voice signal .. ->

..to model voice signals ..

 

..Since then, we can group .. ->

..We can group ..

(The text seems to like to use superfluous phrases such as Over the time and Since then)

 

..Phase [21]; and .. -

..Phase [21], and ..

 

..Whereas sinusoidal-based phase model, .. - the text should explain more about this; the preceding discussion gives no idea why sinusoidal may be useful.  The term sinusoidal is used repeatedly in that title and abstract, but not motivated at all in the text so far.  Also, delete “Whereas.”

 

..Sections 5 discuss the ..

..Section 5 discusses the ..

 

..Whereas a one-hot vector having ..

..Then, a one-hot vector having ..

(“Whereas” must introduce a conditional clause, not a main clause)

 

..One of the problems encountered here is that the de- 97 coder is over-smoothed .. - this needs more explanation, e.g., why? what is being smoothed?

 

..cycle-consistency and adversarial losses .. - these may need further explanation or motivation;  the text is simply naming these ideas with no explanation

 

..several generators-discriminators pairs .. ->

..several generator-discriminator pairs ..

 

..less expensive sinusoidal architectural as .. ->

..less expensive sinusoidal architecture as ..

 

..to address speech analysis-synthesis system ..->

..to address a speech analysis-synthesis system ..

 

..Frequency, phase, and amplitude are typically the main time-varying components in terms of sinusoidal model to synthesize high quality speech.  - the text needs a more detailed discussion of speech spectra, in terms of harmonics, to help the reader understand; simply saying, as here, “Frequency, phase, and amplitude” is too simplistic, as frequency is a dependant parameter, while phase and amplitude are measures.

 

..to model the voiced speech to a sum .. ->

..to model the voiced speech as a sum ..

 

..fundamental frequencies (F0s) - no, a periodic signal has ONE F0; multiples are called harmonics; if audio has multiple voices, then one would have multiple F0s

 

.. limitations of discontinuity in the speech parameters and the complexity of neural vocoders. - this paper repeatedly announces important issues such as these two, without  any current or prior explanation or justification;  you need to explain how discontinuity in the speech parameters is important (and indeed what is the issue here?); also, neural vocoders have not been described - so how can one judge their complexity?

 

Fig. 1 is cited and placed on page 3, but critical items are only defined on the next page.

 

“Some of harmonics…” - does this mean “Sum of harmonics…”

The choice of parameters (F0, MVF, MGC) deserve a more detailed discussion; why choose these 3?  MVF, in particular, is not a common term in the field.  Also, why choose ref. 28 from among the hundreds of existing F0 estimation algorithms?

 

.. may lead to erroneous tracking whenever the harmonic-to-noise ratio is low. - this could also use more discussion; e.g., what is “low” here?

 

Eq. 1 is rather sloppy; e.g., the text uses t (e.g., IF(t)), while the equation use tau; also, the harmonic index k does not appear on the right.

 

Eq. 2 shows an incorrect upper limit; i.e., k=1 as lower limit, and then just k as upper?

 

..?0 denotes the angular frequency of the contF0 at a temporal position.  - which temporal position?

 

..new contF0 is reduced more significantly than for the baseline. - how is this better?  Unvoiced speech should show no F0; “reducing” is not helpful.

 

..36 MGCs, .. -why 36?  The text has no discussion at all about MGC.

 

Fig. 3 talks of “frequency bins”; use Hz instead; what is the definition of bin here?

 

Why does the upper limit in eq.4 use a superscript not used in the text?

 

..? is the all-pass factor .. - again, no explanation of this

 

..Whereas the synthetic noise part ?(?), .. - this phrase makes no sense in the context; please avoid completely using the word “Whereas”

 

..from StarGAN approach [33] which .. ->

..from the StarGAN approach [33], which ..

 

..multi-domain image-to-image translation. - again, another idea introduced, with no explanation

 

..between many-to-many speaker conversion. - “between” needs 2 objects.

 

 

 

 

It is preferable to use initial capital letters for the words in titles such as “.. IEEE international conference on acoustics, speech and signal processing ..” and ”IEEE Transactions on speech and audio processing.”

 

 

..?(?, ?). Where ? is .. ->

..?(?, ?), where ? is ..

 

..best decision boundary between the converted and real acoustic features. - what does “between” mean here?

 

..Converted speech can also be clipped (inevitably changes the spectrum of speech signals), .. - indeed, yes; but why would one wish to clip? Clipping almost always degrades audio. 

 

..To stabilize the training procedure, we applied three preservation 219 losses (adversarial, classification, and reconstruction) in the objective function. -  again, another important idea is not explained; e.g., what is the purpose of these loss functions? Why use 3? Why these three? How do these affect stabilization, and why is such important?

 

..decompose this lose into: .. ->

..decompose this loss into: ..

 

The term “spectral envelope” is use loosely throughout the paper; it seems only to be formally defined at line 296.  Earlier, it is related to MGC and the Hilbert envelope, which are different things.

 

Next, both generator and discriminator are iteratively updated in the training time, where one module is being corrected whereas the module parameters of another are optimized. - the paper contains many statements such as this, which leave many technical questions unanswered, as to complete procedure; e.g., what does “in the training time” mean? What is the difference between “corrected” and “optimized”?

 

..in the logarithmic space. - in the text, this is the first time log is mentioned (although used in several equations earlier); why use log space?

 

..network comprises of two ..->

 ..network is comprised of two ..

 

..are splitted into ..->

 ..are split into ..

 

..and compute the corresponding stats (means and standard deviations).

..and the corresponding stats (means and standard deviations) computed.

 

..of StarGAN-VC model [25] that reached ..

..of the StarGAN-VC model [25] that achieved ..

 

..able to convert speaker characteristic, ..

..able to convert speaker characteristics, ..

 

..?(?) is the spectral power magnitudes ..

..?(?) is the spectral power magnitude ..

(Actually, one should use the subscripts…)

 

..a set of example spectrogram ..

..a set of example spectrograms ..

 

..positively x-axis distribution .. ->

..positive x-axis distribution ..

 

..In a negatively x-axis distribution, as …

..In a negative x-axis distribution, as …

 

..on non-parallel many-to-many VC task ..

..on a non-parallel many-to-many VC task ..

 

..roughly required 12 minutes long. ->

..roughly required 12 minutes. 

 

..The fact that the proposed framework ..

..The proposed framework ..

 

..to the system with more complicated discontinues F0. - I do not understand this comment at all; there is no prior discussion of data with “discontinuous F0”

 

..signals with adversarial training network.

..signals with an adversarial training network.

 

..The advantage of the sinusoidal model is capable of performing .. - rephrase; this makes  little sense

 

..substituting to the Griffin-Lim algorithm [41] and the dimensionality of the ?-vectors [42]. - rephrase; this makes  little sense

 

..the redundancy of the short-time Fourier transform (also known as the spectrogram) to recover a speech signal. - this is unclear; 1) spectrograms were used earlier in the paper, and need no definition here, 2) what is this redundancy and how does it work?

 

..An d-vector is a ..

..A d-vector is a ..

 

..in utterance domain, .. - what does this mean?

..reducing mel-cepstral distortion .. - what is this?

While Fig. 5 has some value, it is quite hard to see what one is supposed to look for in this figure, as far as understanding the work in this paper.

Author Response

Please see the attachment.

Note: Line numbers in Word mess up pdf rendering. I tried many solutions, but the results were always the same. So, I would recommend checking the .docx file. However, the PDF file is also good.

Author Response File: Author Response.pdf

Reviewer 2 Report

row

problem

suggestion

264

Too much space before subchapter 4.4. Full objective, so that the subtitle is on one page and its content on the next page

Bring 4.4 up to row 266

309

Too much free space before the table

Move the table on page 16

414

The listening test roughly required 12 minutes long.

The listening test required roughly 12 minutes.

Or

The listening test lasted for 12 minutes, roughly.

Or

The listening test was roughly 12 minutes long.

425

produce a voice like to the target speaker

produce a voice like that of the target speaker

?

 

Author Response

Please see the attachment.

Note: Line numbers in Word mess up pdf rendering. I tried many solutions, but the results were always the same. So, I would recommend checking the article .docx file. However, the PDF file is also good.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Line 54: ..and leads to speech-quality degradation. ->

..and lead to speech-quality degradation. 

 

..unvoiced speech can be represented as a sum of sinusoids with random phases. - this is only true in theory, as the number of sinusoids to represent noisy speech is very high;  the use of sinusoids is only useful for quasi-periodic signals, not for unvoiced.  Noisy speech is excited by random noise with no specific discrete frequency components.  Now, if one uses enough sinusoids at various unsynchronized values, one might approximate a perceptually-similar sound.

 

..two main networks encoder and decoder. ->

..two main networks: encoder and decoder. 

 

..Consequently, VAE fail to .. ->

..Consequently, VAE fails to ..

 

..speech obtained from converted features are ..->

..speech obtained from converted features is ..

 

..(i.e. which all parameters are continuous) ..

..(i.e., in which all parameters are continuous) ..

 

.. new contF0 is reduced more significantly than for the baseline. - of what value is this for unvoiced speech?   unvoiced speech should have no F0.

 

 

 

 

..all-pass factor takes 0.42 .. ->

..all-pass factor  that takes 0.42 ..

 

..these losses are; the converted data distribution is closer to .. ->

..these losses are, the closer the converted data distribution is to a ..

 

..This loss is worked to reduce the converted features indistinguishable from the.. ->

..This loss works to render the converted features indistinguishable from the..

 

..distribution of a real speech data ? .. ->

..distribution of real speech data ? ..

 

..are followed by ReLU ..

..are followed by a ReLU ..

 

..speaker info (gender, accent, ..

..speaker information (gender, accent, ..

 

..with more complicated discontinues F0. ->

..with more complicated discontinuous F0

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop