Next Article in Journal
Changes and Remodeling of Intersegmental Interferences following Bilateral Sagittal Split Ramus Osteotomy in Patients with Mandibular Prognathism
Next Article in Special Issue
Automatic Identification of Emotional Information in Spanish TV Debates and Human–Machine Interactions
Previous Article in Journal
In-Plane Fragility and Parametric Analyses of Masonry Arch Bridges Exposed to Flood Hazard Using Surrogate Modeling Techniques
Previous Article in Special Issue
Unsupervised Adaptation of Deep Speech Activity Detection Models to Unseen Domains
 
 
Article
Peer-Review Record

Evaluating Novel Speech Transcription Architectures on the Spanish RTVE2020 Database

Appl. Sci. 2022, 12(4), 1889; https://doi.org/10.3390/app12041889
by Aitor Álvarez 1,*,‡, Haritz Arzelus 1,‡, Iván G. Torre 1,‡ and Ander González-Docasal 1,2
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2022, 12(4), 1889; https://doi.org/10.3390/app12041889
Submission received: 29 December 2021 / Revised: 27 January 2022 / Accepted: 2 February 2022 / Published: 11 February 2022

Round 1

Reviewer 1 Report

The paper explores and evaluates three novel ASR architectures on the Spanish RTVE2020 database. This is an excellent paper with a thorough description and results. I have some minor comments only.

The abstract may include the WER of the three ASR engines.

Fig. 1 could be drawn more comprehensively by showing the dimensions and others.

Different n-grams are used in the three ASR engines. Does it affect the WER?

RTF: is it averaged or 90th percentile?

Did you try the multistream CNN in GPUs?

Author Response

Firstly, the authors would like to thank reviewer 1 for the interesting suggestions and contributions, which have led to a significant improvement of the article. 

Please, find the answers of the authors to each suggestion in the following points:

General comment of the reviewer: The paper explores and evaluates three novel ASR architectures on the Spanish RTVE2020 database. This is an excellent paper with a thorough description and results. I have some minor comments only.

General response  of the authors: The authors are grateful for the general comment of the reviewer. 

- Point 1 of the reviewer: The abstract may include the WER of the three ASR engines.

Response 1 of the authors: Since the three engines are not completely described in the abstract, we decided to include the best WER of the previous systems submitted to the challenge, and the new best obtained by the novel systems. This information was included in the abstract in the following sentence:

“As a result, the new speech recognition engines clearly outperformed the performance of the initial systems from the previous best WER of 19.27 to the new best of 17.60 achieved by the DNN-HMM based system.”

- Point 2: Fig. 1 could be drawn more comprehensively by showing the dimensions and others.

Response 2: We have extended the figure and included more information of the architecture. The authors hope that the new figure is now clearer. 

- Point 3: Different n-grams are used in the three ASR engines. Does it affect the WER?

Response 3: Yes, it could. The n-grams (5-grams) are the same for both E2E systems based on Quartznet and Wav2vec2 architectures. Nevertheless, we could not include this LM in the DNN-HMM based system because of the size of the graphic. We had to reduce the n order from 5 to 3 so that the HCLG graphic would fit in memory. However, we added a rescoring process based on 4-gram RNNLM in order to try to compensate for this possible loss.

- Point 4: RTF: is it averaged or 90th percentile?

Response 4: The RTF was computed following the standard metric; that is, considering the total of processing time and dividing it by the total duration of the testset. 

- Point 5: Did you try the multistream CNN in GPUs?

Response 5: Yes, we did. In fact, we computed again the test set using GPU for multistream CNN and CNN-TDNN-F models, and included this information instead of the previous one based on CPU processing. This way, all the decodings were performed on a GPU and the decoding times are now more comparable.

Author Response File: Author Response.pdf

Reviewer 2 Report

The article is clearly written. However, I have two main suggestions.

1) The presented methods are not completely new - they represent modifications of existing methods for acoustic modelling or ASR systems that have been created by other authors.

2) My main suggestion is on the selection of methods or systems that have been evaluated. It is currently not so usual that the best results for 700+ hours of training data are achieved by using a hybrid HMM/DNN system for acoustic modelling. In many cases, the state-of-the-art (SOTA) results are yielded by E2E systems.

Both these facts reduce the significance and attraction of the paper for the community. The authors should therefore try to evaluate more SOTA E2E systems. A good candidate can be wenet:

https://github.com/wenet-e2e/wenet

In my experience, its basic training and evaluation is very simple and it should take just a week or two. The amount of available data (743 hours) should be enough to yield very competitive results (as for many other speech databases - see also wenetweb/examples/$dataset/s0/readme.md for benchmark on different speech datasets). Moreover, wenet is also interesting from practical point of view as it is capable to operate on a CPU with a RTF below one.

The resulting conclusions would then be more actual and interesting to compare with SOTA results for other speech datasets (e.g., see a webpage of English GigaSpeech database, which also compares various ASR systems and where wenet has yielded the best results so far).

https://github.com/SpeechColab/GigaSpeech

I would also appreciate if Table 1 could be extended to include information on whether the individual databases are publicly available and under what conditions (or licence).

Author Response

Firstly, we would like to thank reviewer 2 for the interesting suggestions and contributions, which have led to an interesting improvement of the article. Please, find the answers of the authors to each suggestion in the following points:

General comment of the reviewer: The article is clearly written. However, I have two main suggestions.

- Point 1: The presented methods are not completely new - they represent modifications of existing methods for acoustic modelling or ASR systems that have been created by other authors.

Response 1: The authors fully agree with the reviewer’s comment. The point is that this article intended to be an extension of a previously submitted paper which described the ASR engines submitted to the Albayzín Speech-To-Text transcription challenge. Hence, the main goal was not to construct new ASR architectures, but to build the best engines to score the best position in that challenge. In this article, we tried to improve and evolve the previously submitted systems, focusing on applying techniques to enhance the same presented architectures (DNN-HMM and Quartznet) and adding a new one (Wav2vec2).  As a result, the authors considered that this article represents an interesting benchmarking of several ASR architectures for Spanish in a very challenging dataset. 

- Point 2: My main suggestion is on the selection of methods or systems that have been evaluated. It is currently not so usual that the best results for 700+ hours of training data are achieved by using a hybrid HMM/DNN system for acoustic modelling. In many cases, the state-of-the-art (SOTA) results are yielded by E2E systems.

Response 2: From the authors experience, these results could depend on the training and evaluation conditions. It is a fact that E2E systems perform better and better and that in many scenarios they already outperform more traditional systems. However, from our point of view, it is not yet clear how much data these systems need in order to outperform traditional systems in any condition. For example, if we look at the results recently published in the article presented in [1] (page 24, table 6), we realized that Kaldi CNN-TDNN engine outperformed other E2E systems like DeepSpeech2, Facebook CNN-ASG or Facebook TDS-S2S. 

 

Moreover, the system that scored the first position in the Albayzín Speech2Text challenge [2] was based on the RWTH HMM-DNN engine, although other systems based on the E2E principle were also submitted by other participants. 

[1] Georgescu, A. L., Pappalardo, A., Cucu, H., & Blott, M. (2021). Performance vs. hardware requirements in state-of-the-art automatic speech recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2021(1), 1-30.

[2] Jorge, J., Giménez, A., Baquero-Arnal, P., Iranzo-Sánchez, J., Pérez, A., Dıaz-Munıo, G. V. G., ... & Juan, A. (2021). MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge. Proceedings of the IberSPEECH, Valladolid, Spain, 24-25.

- Point 3: Both these facts reduce the significance and attraction of the paper for the community. The authors should therefore try to evaluate more SOTA E2E systems. A good candidate can be wenet:

https://github.com/wenet-e2e/wenet

In my experience, its basic training and evaluation is very simple and it should take just a week or two. The amount of available data (743 hours) should be enough to yield very competitive results (as for many other speech databases - see also wenetweb/examples/$dataset/s0/readme.md for benchmark on different speech datasets). Moreover, wenet is also interesting from practical point of view as it is capable to operate on a CPU with a RTF below one.

The resulting conclusions would then be more actual and interesting to compare with SOTA results for other speech datasets (e.g., see a webpage of English GigaSpeech database, which also compares various ASR systems and where wenet has yielded the best results so far).

https://github.com/SpeechColab/GigaSpeech

Response 3: The authors greatly appreciate this recommendation. It would be really interesting to check the performance of this architecture on the test set. Nevertheless, the authors greatly regret that considering the time and resources currently available, there was not enough space to conduct these experiments correctly, even more considering that the authors have no previous experience with this ASR system. However, the authors have included this reference in the paper in Section 2 in the following sentence:

“More recently, new attempts have been focused on building E2E ASR architectures [9], which directly map the input speech signal to character sequences and therefore greatly simplify training, fine-tuning and inference [10–14]” 

Furthermore, following the reviewer’s suggestion, we tried to improve our Wav2vec2 based system by evolving the model with more training epochs. This way, instead of the previous 15 epochs, we evolved the pre-trained for 30 epochs. Additionally, we included a fine-tune of the model with in-domain data for 20 additional epochs. This evolution helped the model to perform better, reducing the initial WER from 21.62 to 20.68. 

- Point 4: I would also appreciate if Table 1 could be extended to include information on whether the individual databases are publicly available and under what conditions (or licence).

Response 4: The authors included the type of licenses for each dataset in Table 1.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

My comments on the article have been settled.

Back to TopTop