Next Article in Journal
Impact Performance of Helmholtz Self-Excited Oscillation Waterjets Used for Underground Mining
Next Article in Special Issue
Supervector Extraction for Encoding Speaker and Phrase Information with Neural Networks for Text-Dependent Speaker Verification
Previous Article in Journal
Experimental Study on Rock-Like Specimens with Single Flaw under Hydro-Mechanical Coupling
Previous Article in Special Issue
Application of Pitch Derived Parameters to Speech and Monophonic Singing Classification
 
 
Article
Peer-Review Record

Intelligibility and Listening Effort of Spanish Oesophageal Speech

Appl. Sci. 2019, 9(16), 3233; https://doi.org/10.3390/app9163233
by Sneha Raman 1,*, Luis Serrano 1, Axel Winneke 2, Eva Navas 1,* and Inma Hernaez 1,*
Reviewer 1: Anonymous
Reviewer 2:
Appl. Sci. 2019, 9(16), 3233; https://doi.org/10.3390/app9163233
Submission received: 12 July 2019 / Revised: 1 August 2019 / Accepted: 5 August 2019 / Published: 8 August 2019

Round 1

Reviewer 1 Report


This is a useful and well-written article, on a suitable topic of interest.  The authors clearly are aware of the field, and have done suitable work. I have these minor corrections to do:

>… the main components of the speech apparatus, such as the vocal folds, …- this is too general; speech requires much more than vocal cords; so describing VC as “the main components” is inappropriate

…promising option post laryngectomy.  -> …promising option post-laryngectomy. 

…these less intelligible voices perform poorly with machines which are operated by speech input. - poorly expressed; rephrase; voices do not “perform poorly” 

recognition, or ASR performance is lower compared to HSR performance [6],  ->

recognition, or ASR performance, is lower compared to HSR performance [6], 

The main advantage of objective measurements is that easy to replicate and to implement.  ->

The main advantage of objective measurements is that they are easy to replicate and to implement.

Several studies have devised systems for the analysis of pathological speech which also includes … ->

Several studies have devised systems for the analysis of pathological speech, which also includes

A lot of these studies focus on the micro level of words and vowels.  ->

A lot of these studies focus on the micro-level of words and vowels. 

oesophageal voice is a less traversed area of investigation.  ->

oesophageal voice are a less traversed area of investigation.

 judgement of effort. 

 judgment of effort.

speaker (e.g. foreign accent, impaired speech), the listener (e.g. hearing impairment) or …->

speaker (e.g., foreign accent, impaired speech), the listener (e.g., hearing impairment) or …

(this error occurs often…)

The parallel data used for this task contains 100 phonetically balanced … ->

The parallel data used for this task contain 100 phonetically-balanced …


…hevea?, atony -what are these?

Speakers who practised for at least two years after the laryngectomy, qualified as proficient speakers... ->Speakers who practised for at least two years after the laryngectomy qualified as proficient speakers...

follow-up recordings is useful for future research.  ->

follow-up recordings are useful for future research. 

them to ensure they can hear the sound properly. ->

them to ensure they could hear the sound properly.

lexicon of the original ASR system which contained 37, 632 entries. ->

lexicon of the original ASR system, which contained 37, 632 entries. 

Levenshtein distance takes into account the insertions, … ->

The Levenshtein distance takes into account the insertions, …

For healthy speech there is slight difference of around 3 points in the  ->

For healthy speech there is a slight difference of around 3 points in the

Error bars show 95% confidence intervals  ->

Error bars show 95% confidence intervals.

(same omission of a sentence-final period in each figure…)

…for both healthy and ES. ->. …for both HS and ES. 


…the system is using a unigram language model contributes greatly to this poor performance. - this would not be understood by a non-expert; i.e., explain why is such an LM is a problem

…number of speakers in this experiment is small to draw 

-> …number of speakers in this experiment is too small to draw 

 4. Experiment 2: LE for High-Intelligibility ES  - avoid using acronyms in section titles (where there is ample space…)

 Section. 3), we had … ->

 Section 3), we had …


…. listeners side in memorising the sentence. 

…. listeners’ side in memorising the sentence. 


 be excluded as confounding factor for WER.

 be excluded as a confounding factor for WER.


However, this involves post processing i.e. the task of 

However, this involves post processing, i.e., the task of 

(same as e.g., …)

Intelligibility test was 

An Intelligibility test was

A 13 point scale from ’Ningún esfuerzo’ (No effort) 

A 13-point scale from ’Ningún esfuerzo’ (No effort)

… then they proceeded for the intelligibility 

… then they proceeded to the intelligibility 

The audio responses of the sentence recognition task was transcribed …

The audio responses of the sentence recognition task were transcribed …

... was 6.457±3.150 and for the healthy speaker it was 1.994±1.611. 

... were for the ES speaker 6.457±3.150 and for the healthy speaker 1.994±1.611. 


Mean Flanker effect score was

The Mean Flanker effect score was


Do not have one-sentence paragraphs at the end of section 4.2.3


for both ES an HS can be considered to be very high. 

for both ES and HS can be considered to be very high. 


Finally, the familiarity effect on LE as reported in experiment 1, could mean 

Finally, the familiarity effect on LE as reported in experiment 1 could mean 

That is, is there a point, where, due to familiarity, they find ES as effortful as HS.  - this is a question, an thus requires a ? at the end.

intelligibility and self reported LE metrics. 

intelligibility and self-reported LE metrics.

Author Contributions: “conceptualization, …

Author Contributions: conceptualization, …


Many of the References use inconsistent notation, e.g., lacking capital letters in titles, repeating the year, poor cutting-and-pasting.

Examples:

 Clinical linguistics & phonetics  

Should be:

Clinical Linguistics & Phonetics 

ref. 11 lacks a source

ref. 17: … Computing, 2006. CIC’06. 15th International Conference on. IEEE, 2006, … - very poor text; repeats date 3 times

refs. 13, 31, 37 refer to the same conference, but in three different ways.

refs.38 and 39: same conference, but 39 is sloppy.

Ref. 41-42: no source


Author Response

Response to Reviewer 1 Comments

 

Point 1: Minor corrections

 

Response 1: All typographical, grammatical and punctuation errors were corrected as suggested.

 

Point 2: >… the main components of the speech apparatus, such as the vocal folds, …- this is too general; speech requires much more than vocal cords; so describing VC as “the main components” is inappropriate

 

Response 2: Agreed. Changed as follows: Along with the larynx, the vocal folds are also removed and therefore it leaves the laryngectomee with a disability to speak.

 

Point 3:…these less intelligible voices perform poorly with machines which are operated by speech input. - poorly expressed; rephrase; voices do not “perform poorly”

 

Response 3: Agreed. Changed as follows: Moreover, these less intelligible voices are not well received by machines which are operated by speech input.

 

Point 4:…hevea?, atony -what are these?

 

Response 4: Meanings explained in brackets

 

Point 5: the system is using a unigram language model contributes greatly to this poor performance. - this would not be understood by a non-expert; i.e., explain why is such an LM is a problem

 

Response 5: Explanation added (line 216): A unigram language model is a very simple language model and only considers probabilities of single words. A more sophisticated language model would consider the probabilities of single words and also combinations of words (word pairs, word triplets etc.), which can result in better ASR performances.

 

Point 6:... was 6.457±3.150 and for the healthy speaker it was 1.994±1.611.

... were for the ES speaker 6.457±3.150 and for the healthy speaker 1.994±1.611.

 

Response 6: Rewritten as: Mean LE (from a 13 point scale) for the ES speaker was  6.457± 3.150 and for the healthy speaker it was 1.994±1.611.

 

Point 7: Many of the References use inconsistent notation, e.g., lacking capital letters in titles, repeating the year, poor cutting-and-pasting.

 

Response 7: All inconsistencies and errors in references were corrected as suggested

 

Many thanks for all the suggestions

Author Response File: Author Response.docx

Reviewer 2 Report

Introduction:

The hypotheses as summed up at the end of the Introduction should be better focussed on the innovative experimental ones, i.e. I suggest the authors to consider the first two hypotheses (WER positively correlated with self LE ratings, and HS is less effortful than ES) as preliminary evidences that come from the literature and from previous experimental results. 

Literature review:

The most relevant references are there (e.g. Lippmann 1997). However, to improve the contribution, I suggest the authors to add some more recent references (section 2.1).

Experiment 1:

Line 146-147: “to avoid priming” —> the identity condition in a priming experiment is only one particular control condition. The priming effect relies on relations (be they morohological or semantic) between inputs. In your case, I would say that by inputting sentences only once, you want avoid any possible memory effect. 

Line 149-150: please add some more details on your selection criteria. e.g. what did you control? and with which method?

Line 156: “phonetically balanced” how did you perform this balance? It would be interesting to add it (in the text or in a footnote as well).

 Statistical analysis are clearly defined.

Experiment 2: I can hardly see the added value of it. You make reference to EEG data but then that they are still under anakysis and therefore not included in the paper. Thus, why did you mention them?

Cognitive tasks: the task in itself is well thought and well defined...but I find it out of your focus.

And you yourselves in the Discussion section highlight primarily your evidence as report in Exp. 1.

My general comment and suggestion goes in the direction of better focussing your goal and focus (as reported in exp 1).

Author Response

Response to Reviewer 2 Comments

 

Point 1: Introduction: The hypotheses as summed up at the end of the Introduction should be better focussed on the innovative experimental ones, i.e. I suggest the authors to consider the first two hypotheses (WER positively correlated with self LE ratings, and HS is less effortful than ES) as preliminary evidences that come from the literature and from previous experimental results

 

Response 1: We have separated the first two hypotheses as preliminary evidences. And the rest as hypotheses categorised by experiment 1 and 2.

 

Point 2: Literature review: The most relevant references are there (e.g. Lippmann 1997). However, to improve the contribution, I suggest the authors to add some more recent references (section 2.1).

Response 2: New references have been added.

Andersen et al 2017, Sharma et.al 2016, Van Kuyk et al 2018:

Line 84: ‘Newer approaches of intelligibility measurement can be found in [10], [11] and [12].’

 

Scharenborg.O, 2007:

 Line 95: ‘A more recent review of HSR and ASR methods can be found in [15].’

 

 

 

 

Point 3: Experiment 1: Line 146-147: “to avoid priming” —> the identity condition in a priming experiment is only one particular control condition. The priming effect relies on relations (be they morohological or semantic) between inputs. In your case, I would say that by inputting sentences only once, you want avoid any possible memory effect

 

Response 3: Changed as suggested: ‘The sentences were played only once (to avoid any possible memory effect)’

 

Point 4: Line 149-150: please add some more details on your selection criteria. e.g. what did you control? and with which method?

 

 “phonetically balanced” how did you perform this balance? It would be interesting to add it (in the text or in a footnote as well).

 

Response 4: We performed this with the corpusCRT tool. The details of this has been added.

 

Line 151: ‘The selection of sentences was performed with a greedy algorithm based tool called corpusCRT [35] with the criteria of maximised diphone coverage and a maximum of 15 words per sentence.’

 

Point 5: Experiment 2: I can hardly see the added value of it. You make reference to EEG data but then that they are still under anakysis and therefore not included in the paper. Thus, why did you mention them?

 

Response 5: The fact that it was an EEG experiment was mentioned to be true to the experiment setup and not to leave out any information. Although it did not have any effect on the results, but some experiment conditions were due to the fact that it was an EEG experiment. E.g. because the participant had an EEG cap on, the stimuli were played on a loudspeaker, and not on headphones. This is now added in the description.

 

Line 299: ‘As the participants had an EEG cap on, the stimuli were played on a loudspeaker, and not on headphones.’

 

Point 6: Cognitive tasks: the task in itself is well thought and well defined...but I find it out of your focus.

 

And you yourselves in the Discussion section highlight primarily your evidence as report in Exp. 1.

 

My general comment and suggestion goes in the direction of better focussing your goal and focus (as reported in exp 1).

 

Response 6: Performing cognitive tasks was just an attempt to see if the differences in cognitive abilities had any effect on intelligibility and LE scores. We have tried to present the focus of this experiments in a better way in the introduction (with separate hypotheses and a preceding paragraph on what is the focus).

 

Line 55: ‘This study contains two experiments. The first experiment was web-based, and was focussed on getting preliminary intelligibility and LE metrics for our data. We investigated how intelligibility (both ASR and HSR) and LE differs for the two speech types (ES and HS). We also investigated to what extent, intelligibility and LE are correlated. The second experiment (an extension of experiment 1) was conducted in a laboratory setting, which allowed us better control of the experiment environment. The aim of this experiment was to know if more LE is reported for ES even if the intelligibility of ES is close to HS. Additionally, in this experiment, we also investigated if the participants' performance in the speech perception tasks depended on their cognitive abilities.’

 

We hope the added value of this experiment is evident from this.

 

Many thanks for the suggestions.

 

 

Author Response File: Author Response.docx

Back to TopTop