Next Article in Journal
A Novel Preprocessing Method for Dynamic Point-Cloud Compression
Next Article in Special Issue
Speech-Based Support System to Supervise Chronic Obstructive Pulmonary Disease Patient Status
Previous Article in Journal
Impact of Praseodymia Additions and Firing Conditions on Structural and Electrical Transport Properties of 5 mol.% Yttria Partially Stabilized Zirconia (5YSZ)
 
 
Article
Peer-Review Record

Enrichment of Oesophageal Speech: Voice Conversion with Duration–Matched Synthetic Speech as Target

Appl. Sci. 2021, 11(13), 5940; https://doi.org/10.3390/app11135940
by Sneha Raman *, Xabier Sarasola, Eva Navas and Inma Hernaez *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4:
Appl. Sci. 2021, 11(13), 5940; https://doi.org/10.3390/app11135940
Submission received: 9 April 2021 / Revised: 15 June 2021 / Accepted: 18 June 2021 / Published: 26 June 2021
(This article belongs to the Special Issue Applications of Speech and Language Technologies in Healthcare)

Round 1

Reviewer 1 Report

*** Overall Comments

The paper proposes a new development based on previous methods for Voice Conversion (VC) of Oesophageal Speech, for improved intelligibility and perceptual quality. This is clearly a scientifically interesting application of speech technology, relevant for this journal.

The methods are clearly described and the results seem promising. 

I think the paper is in principle acceptable for publication. I have only a couple of minor requests for clarification or comment.

 

*** Minor Comments

* STOI?  The STOI is a measure of similarity to the reference. In this case the VC system was trained precisely to produce output speech that resembles the synthetic speech (SS) reference. Then the STOI used exactly the SAME training signals as reference. Therefore, it seems to me the STOI, as applied here, is mainly a technical validation measure indicating how well the BLSTM training procedure managed to do the task it was designed to do. It would have been extremely astonishing if this STOI comparison did not show a large improvement. 

Of course, it is a good thing that the STOI showed good results, and it is OK to show the results. But I think it not quite fair to call it an “objective intelligibility measure” in this case, since we do not have a calibration on the “intelligibility” of the reference SS. It would be more fair to call it a “validation measure” of the BLSTM.

 

* Practical Usage?  How do the authors envision the practical use of this or other similar VC systems? I understand the potential benefit in communication by voice input to an ASR system, but what about the usage in a common person-to-person conversation? In that application, the listener would have to hear every utterance twice: first the OS speech and then the VC version of the same utterance. 

Thus, every spoken sentence would take twice the normal time in practice. How would OS speakers and their conversation partners feel about this problem? The subjective test asked listeners to compare unprocessed OS vs processed VCOS. This is OK, but in order to mimic a real conversation, the evaluation should also compare unprocessed OS vs (OS + VCOS). I am not convinced that there would be a clear preference for (OS + VCOS) in this case.

 

* Subjective Test?  The description seems to suggest that the test was NOT done in the lab but somehow by computer over the web, since participants needed to describe their sound equipment. Please describe briefly how the “test to collect listeners’ opinion” was performed, technically.

Author Response

Please see the attachment

Author Response File: Author Response.docx

Reviewer 2 Report

This paper is worth for acceptance, novelty of the idea seems interesting and small changes need to be incorporated in order to enhance.
The article has been read carefully, and some crucial issues have been highlighted in order to be considered by the author(s).The paper needs to be restructured in order to be precise.The Introduction and related work parts give valuable information for the readers as well as researchers. In addition recent papers should be added in the part of related work.
As it is real time application oriented, authors should care over the outcome of the proposed framework by meeting the future requirements too.
It would be good if similar domains, such as adversarial examples, would be reflected in security issue future research or related work.
[1]Kwon, Hyun, Hyunsoo Yoon, and Ki-Woong Park. "POSTER: Detecting audio adversarial example through audio modification." Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. 2019.

[2] Kwon, Hyun, Hyunsoo Yoon, and Ki-Woong Park. "Acoustic-decoy: Detection of adversarial examples through audio modification on speech recognition system." Neurocomputing417 (2020): 357-370.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 3 Report

The problem of the presented manuscript is modern and topical. The oesophageal speech (after laryngectomy procedure) is a one of major quality of life issue for people who have undergone the treatment surgery. The authors of the manuscript present a neural network-based speech conversion solution improving the speech signal quality, its intelligibility.

The relevance of the manuscript topic is obvious and indisputable. Neural networks (and other machine learning / artificial intelligence techniques) are among the most widely used and most promising methods to date.

Nevertheless, the technical quality of the manuscript could be improved. The manuscript tends to be a purely technological, but it lacks a deeper methodological, technical knowledge. After minor but appropriate corrections, the manuscript could be considered as an article of moderate but fair quality.

Some editorial remarks that may help to improve the manuscript:

  • The subjective evaluation of the enriched speech should be mentioned in abstract.
  • According to the definitions of Percentage Words Correct (PWC, page 4) and the Word Error Rate (WER, page 5), the sum of these rates should be equal to 100 % (please confirm it or deny). But analysis of the figures 2-4 shows it is not the case. The sum of PWC and WER rates (for the same ASR engine) exceeds the 100% level, which is incomprehensible. Furthermore, in Figure 3a) we can find the WER rate higher than 100%! This suggests the idea of some serious error in evaluation methodology and should be verified substantially.
  • The results in figures 2-4 should ne analyzed more deeply, some insights should be given. What is the reason of low results for Spanish speech recognition engine? Which qualities of the speaker 16M3 might have led to the poor WER / PWC results? How do WER / PWC results correlate with speaking proficiency of the persons?
  • The proposed approach for speech signal enrichment is based on HMM speech synthesis system, vocoders, and bidirectional long short-term memory neural network. Which element of the proposed approach is most important in terms of quality (STOI measure value)? Which element should be improved to increase the speech enrichment even more?
  • Why was the STOI measure selected for objective evaluation? What are the alternatives for objective evaluation of speech signal intelligibility and quality?
  • Figure 6 (page 7) contains different STOI score for speaker 02M3 than in Figure 7 (page 8).
  • The subjective evaluation included the description of the audio equipment as good, normal, etc. What is the criterion for a such description? Have technical requirements been formulated for the equipment? Were the quality description results obtained from non-professional acousticians reliable?
  • The comment of Figure 8 (pages 8-9) gives textual description of the graphical results (Figure 8), but does not provide any new information or knowledge. Some insights and generalizations are expected to be given in comments.
  • Mistype error on page 9. The results for speaker 04M3 were referenced as Figure 8a, although there should be Figure 8b.
  • The discussion section contains statement “With a higher intelligibility, the enrichment would be beneficial as observed from results of other speakers”. This statement is speculative and unreasonable.
  • The second paragraph of conclusion section (page 10) repeats the statements from discussion section. The conclusion section should present insights, new knowledge.
  • The conclusion part of the manuscript should present some insights, new knowledge. This version on conclusion is just a summary of the results. Please, revise and complete the conclusions section.

Author Response

Please see the attachment

Author Response File: Author Response.docx

Reviewer 4 Report

Convincing paper, excellent methodology. I suggest you correct the following minor errors:

129 evaluted -> evaluated

226 maybe interpreted -> may be interpreted

249 The proposed system can possibly be improved are by using newer DNN VC technologies

256 in on the rise -> is on the rise

260 in case of the least intelligible speaker -> in the case of …  [‘in case of’ = “if this were to happen”, not the meaning here]

274 smart phone -> smartphone

276 in the form of software plugin -> in the form of a software plugin

Author Response

Please see the attachment

Author Response File: Author Response.docx

Back to TopTop