Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Intelligibility Improvement of Esophageal Speech Using Sequence-to-Sequence Voice Conversion with Auditory Attention

Appl. Sci. 2022, 12(14), 7062; https://doi.org/10.3390/app12147062

by Kadria Ezzine^1,2,*, Joseph Di Martino³ and Mondher Frikha²

Reviewer 1:

Mohammed Salah Al-Radhi

Reviewer 2:

Douglas O'Shaughnessy

Appl. Sci. 2022, 12(14), 7062; https://doi.org/10.3390/app12147062

Submission received: 15 June 2022 / Revised: 7 July 2022 / Accepted: 11 July 2022 / Published: 13 July 2022

(This article belongs to the Special Issue Applications of Speech and Language Technologies in Healthcare)

Round 1

Reviewer 1 Report

Key Strength of the paper:

1. Topic of interest of a broader audience

2. Well-written paper

Main Weakness of the paper:

1. I do not find in the paper novel approaches.

2. The experimental protocol is not clear.

Comments:

1. The methodology part is not clear makes the paper confusing.

2. The recurrent neural networks RNN (like LSTM, BLST, GRU, etc.) is also called sequence to sequence (Seq2Seq). It would be worth it for the author to explain clearly what is the differences in the architecture of the proposed model.

3. The background section should focus on very recent works.

4. The experimental results are lacking in rigour, and many results are subjective.

5. Some references are missing; for example, in terms of envelope time domain based on the cepstral function, it is good to cite this paper: Al-Radhi, M.S., Csapó, T.G., Németh, G. (2017) Time-Domain Envelope Modulating the Noise Component of Excitation in a Continuous Residual-Based Vocoder for Statistical Parametric Speech Synthesis. Proc. Interspeech 2017, 434-438, DOI: 10.21437/Interspeech.2017-678

Author Response

Responses to Reviewer 1 Comments

Point 1: The methodology part is not clear makes the paper confusing.

Response 1: Thank you very much for your comment. We revised Subsection 2.3 as follows “ We consider a Seq2Seq framework that models the mapping relationship between X(s) and X(t). Like most sequence-to-sequence models which were originally proposed by Cho et al. for machine translation applications [27], our Seq2Seq model has an encoder-decoder structure equipped with an attention mechanism. The encoder network processes the input sequence of esophageal speech into a high-dimensional feature space in order to facilitate the decoding process. Then, compresses and summarizes them into a fixed-length context vector, also called "hidden representation h = [h₁, ..., h_t, ..., h_Nh ]" which represents all the information concerning the source features:
h = Encoder(X^(s); θ_encoder ), (5)
where θencoder = [W_encoder, U_encoder, b_encoder] are the parameters of the encoder model.
At each time step i, the decoder takes the output of the encoder, i.e. the hidden representations h_t, and previously generated features y_i₋₁considered as inputs and applies its attention mechanism network and then progressively generates the current enhanced output y_i.
yi = Decoder(h_t, y_i₋₁; θ_decoder ), (6)
Note that, in the typical Seq2Seq framework, the last hidden layer is used as the context vector once the entire sequence is fully encoded. However, in our case, the length of the ES feature sequence is generally longer than that of the normal speech, so the ES feature vectors of length N_x used to obtain the target feature vectors of length N_y are dynamically changed at different time steps.
To model these various nonlinear mapping relationships more accurately, we adopt an attention mechanism (detailed in the figure) to obtain a self-adaptive context vector to adaptively estimate the decoder output”.

We also added Figure 4 in Subsection 2.3.2, which illustrates the attention technique used to understand the architecture of the model more deeply.

Point 2: The recurrent neural networks RNN (like LSTM, BLST, GRU, etc.) is also called sequence to sequence (Seq2Seq). It would be worth it for the author to explain clearly what is the differences in the architecture of the proposed model.

Response 2: Thank you for your nice question. Our proposed architecture, consisting of a sequence-to-sequence matching model including an attention mechanism, is shown in Figure 1 and uses the encoder-decoder architectures, as well as the content-based attention mechanism already presented in Section 2.3.1 This architecture consists of a stacked LSTM encoder, similar to what is done in the simple LSTM-based framework, followed by an attention LSTM decoder.

First, we use the encoder-decoder architecture because it is more powerful than a simple recurrent neural network RNN. The RNNs like LSTM, BLSTM, and GRU cannot be used (in their standard configuration) for sequence transformation tasks. On the other hand, encoder-decoder architectures are conditional autoregressive models, i.e., they generate a sequence element by element by conditioning another sequence. Second, for our proposed model, we added the attention mechanism that allows the system to focus on specific parts of the input sequence, which leads to more efficient extraction of relevant information. There is no doubt about the attention technique; it is another level of deep learning. For training the VC model, we used the seq2seq model with attention and the simple RNN-based models (LSTM and BLSTM), and we observed a remarkable improvement in performance with seq2seq with attention. We have also added in subsection 3.2 more explanation of the different architectures implemented.

Point 3: The background section should focus on very recent works.

Response 3: Thanks for your nice reminder. We provided the following citations to support this statement:

“Raman, S., Sarasola, X., Navas, E., & Hernaez, I. (2021). Enrichment of oesophageal speech: Voice conversion with duration–matched synthetic speech as target. Applied Sciences, 11(13), 5940”
“Alers, T. J., Fennema, B. A., & van Breukelen, J. J. (2020). Tracheo-esophageal speech enhancement: Real-time pitch shift and output”
“Tanaka, K., Kameoka, H., Kaneko, T., & Hojo, N. AttS2S-VC: Sequence-to-sequence voice conversion with attention and context preservation mechanisms. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019 (pp. 6805-6809). IEEE”

Point 4: The experimental results are lacking in rigour, and many results are subjective.

Response 4: Thank you very much for your nice reminder. To compare the voice quality performance of the proposed speech enhancement methods and the baseline methods, we adopted four objective measures in the temporal, frequency, and perceptual domain. We have already presented the different objectives measures in subsection 3.3.

Point 5: Some references are missing; for example, in terms of envelope time domain based on the cepstral function, it is good to cite this paper: Al-Radhi, M.S., Csapó, T.G., Németh, G. (2017) Time-Domain Envelope Modulating the Noise Component of Excitation in a Continuous Residual-Based Vocoder for Statistical Parametric Speech Synthesis. Proc. Interspeech 2017, 434-438, DOI: 10.21437/Interspeech.2017-678

Response 5: Thanks for your nice reminder. We have included the above citation in the references.

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper proposes a voice conversion (VC) system for esophageal speech (ES), based on a sequence-to-sequence (Seq2Seq) model with auditory attention. It does not require the classical DTW alignment process during learning.

The paper is very well done, both in research and in writing. The algorithm has many technical details. The research is worthwhile and useful.

Specific points:

..train laryngectomees patients to re-speak .. ->

..train laryngectomee patients to speak ..

..liberal speech therapist .. - why say ”liberal”?

Figure 1 is not useful: 1) the time scale (5 s) is far too long to see any detail, 2) why bother displaying pitch contours at all?, 3) are the two displays supposed to be the same sentence? It is hard for the reader to line up anything.

..where [10] and [11] have been carried out .. ->

([10] and [11])

..based on VC technique .. ->

..based on the VC technique ..

..that can adaptively characterize .. ->

..that it can adaptively characterize ..

.. Seq2Seq model applied ..

.. Seq2Seq model is applied ..

..e propose to predict ..

..we propose to predict ..

..consists in applying ..

..consists of applying ..

..where N define the frame ..

..where N defines the frame ..

Do not indent or capitalize the immediate text after eqs. 1-2

..utterances, where, Ns and .. ->

..utterances, where Ns and ..

..(i.e generally ..

..(i.e., generally ..

..control the data flow. An ..

..control the data flow: an ..

..at tth time step ..

..at the tth time step ..

..is formulated as follow: ..

..is formulated as follows: ..

..where, xt is the input ..- no indent; also, delete comma; same after eqs. 8 and 11 and 13 and 15 too

..i.e, different ES phonemes ..

..i.e., different ES phonemes ..

..αij is the attention weights ..

..αij are the attention weights ..

Lines 184 and 193: according to the formula ??. - correct this

..are linearly projected to produce..

..is linearly projected to produce..

Line 238 appears to repeat information in line 235

..initialized at 10−3. - use the exponent correctly

..loss does not refine for 10 epochs. - what does “refine” mean?

..Though, our proposed .. ->

..However, our proposed ..

..based on BiLSTM encoder-decoder network ..

..based on a BiLSTM encoder-decoder network ..

..Table 3 list the ..

..Table 3 lists the ..

..Table 3. Perfermance comparaison ..

..Table 3. Performance comparison ..

..DNN with our mothod ..

..DNN with our method ..

..presents VC system for ..

..presents a VC system for ..

Why are some years in boldface in the references?

ref. 22: years are different (also: no need to state the year twice)

Author Response

Responses to Reviewer 2 Comments

General Comment: This paper proposes a voice conversion (VC) system for esophageal speech (ES), based on a sequence-to-sequence (Seq2Seq) model with auditory attention. It does not require the classical DTW alignment process during learning.

The paper is very well done, both in research and in writing. The algorithm has many technical details. The research is worthwhile and useful.

Response : Thank you very much for agreeing with us on the intent of this manuscript. We have carefully read your comments and done our best to address them one by one. We hope that the manuscript has been improved after this revision.

Please see the attachment.

Specific points:

Point 1: ...train laryngectomees patients to re-speak … -> ..train laryngectomee patients to speak ..

Response 1: Thank you for your nice reminder. Revised accordingly.

Point 2: liberal speech therapist ... - why say ”liberal”?

Response 2: Thank you very much for your nice reminder. We revised as “speech therapist”.

Point 3: Figure 1 is not useful: 1) the time scale (5 s) is far too long to see any detail, 2) why bother displaying pitch contours at all?, 3) are the two displays supposed to be the same sentence? It is hard for the reader to line up anything.

Response 3: Thank you very much for your reminder. Since you found figure 1 not useful, we decided to remove it. In addition, we replaced the following sentence “We can observe that the pitch is chaotic and the ES is characterized by a much lower Harmonics to Noise Ratio (HNR) than the laryngeal speech, therefore, the analysis and extraction of F0 is quite difficult, if not impossible. In addition, it is clear that ES presents low intensity and high specific noises in all frequency bands, resulting in a degradation of naturalness and audio quality.” by this sentence “Compared to normal speech, esophageal speech is characterized by poor intelligibility and poor quality due to chaotic fundamental frequency, specific noises that resemble belching, with low intensity.”

Point 4: …where [10] and [11] have been carried out ... -> ([10] and [11])

Response 4: Thank you for your reminder. Revised accordingly.

Point 5: …based on VC technique ... -> …based on the VC technique...

Response 5: Thank you for your reminder. Revised accordingly.

Point 6: ...that can adaptively characterize... ->…that it can adaptively characterize

Response 6: Thank you for your reminder. Revised accordingly.

Point 7: …Seq2Seq model applied ... ->… Seq2Seq model is applied...

Response 7: Thank you for your reminder. Revised accordingly.

Point 8 .e propose to predict... ->…we propose to predict...

Response 8: Thank you for your reminder. Revised accordingly.

Point 9: …consists in applying... ->…consists of applying...

Response 9: Thank you for your reminder. Revised accordingly.

Point 10: …where N define the frame... ->...where N defines the frame...

Response 10: Thank you for your reminder. Revised accordingly.

Point 11: …Do not indent or capitalize the immediate text after eqs. 1-2

Response 11: Thank you for your nice reminder. Revised accordingly after eqs. 1-2. .

Point 12: …utterances, where, N_s and... ->…utterances, where N_sand...

Response 12: Thank you for your reminder. Revised accordingly.

Point 13: … (i.e generally... ->… (i.e., generally...

Response 13: Thank you for your reminder. Revised accordingly.

Point 14: …control the data flow. A… ->…control the data flow: an...

Response 14: Thank you for your reminder. Revised accordingly.

Point 15: ...at t_th time step… ->…at the t_th time step…

Response 15: Thank you for your reminder. Revised accordingly.

Point 16: ...is formulated as follow: ... ->…is formulated as follows:...

Response 16: Thank you for your reminder. Revised accordingly.

Point 17: …where, x_tis the input...- no indent; also, delete comma; same after eqs. 8 and 11 and 13 and 15 too

Response 17: Thank you for your reminder. All equations are revised accordingly.

Point 18: ...i.e, different ES phonemes ... ->...i.e., different ES phonemes...

Response 18: Thank you for your reminder. Revised accordingly.

Point 19: ...α_ijis the attention weights... ->…α_ij are the attention weights...

Response 19: Thank you for your reminder. Revised accordingly.

Point 20: Lines 184 and 193: according to the formula??... - correct this

Response 20: Thank you for your reminder. Revised and corrected accordingly.

Point 21: …are linearly projected to produce… ->…is linearly projected to produce..

Response 21: Thank you for your reminder. Revised and corrected accordingly.

Point 22: Line 238 appears to repeat information in line 235

Response 22: Thank you for this very nice reminder. Revised accordingly, we have deleted the sentence 238.

Point 23: ...initialized at 10−3. - use the exponent correctly

Response 23: Thank you for your nice reminder. Revised and corrected accordingly.

Point 24: …loss does not refine for 10 epochs. - What does “refine” mean?

Response 24: Thank you for your question. “refine” in our context mean “improve” .

Point 25: ...Though, our proposed... -> ...However, our proposed...

Response 25: Thank you for your nice reminder. Revised accordingly.

Point 26: ...based on BiLSTM encoder-decoder network... ->…based on a BiLSTM encoder-decoder network...

Response 26: Thank you for your nice reminder. Revised accordingly.

Point 27: …Table 3 list the...->...Table 3 lists the...

Response 27: Thank you for your nice reminder. Revised accordingly.

Point 28: ...Table 3. Perfermance comparaison... ->...Table 3. Performance comparison...

Response 28: Thank you for your nice reminder. Revised accordingly.

Point 29: ...DNN with our mothod...-> ...DNN with our method...

Response 29: Thank you for your nice reminder. Revised accordingly.

Point 30: ...presents VC system for...->…presents a VC system for...

Response 30: Thank you for your nice reminder. Revised accordingly.

Point 30: Why are some years in boldface in the references?

Response 31: Thank you for your very nice comment. In fact we followed the Applied Science model where the years in the reference of the journal articles should be in bold, as we have seen in several articles published in Applied Science.

Point 32:..ref. 22: years are different (also: no need to state the year twice)

Response 32: Thank you for your nice reminder. yes you are right we put 2008 instead of 2018, revised and corrected accordingly.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The authors have satisfactorily replied to the concerns raised and complemented the manuscript accordingly.

Article Menu

Intelligibility Improvement of Esophageal Speech Using Sequence-to-Sequence Voice Conversion with Auditory Attention

Further Information

Guidelines

MDPI Initiatives

Follow MDPI