The XMUSPEECH System for Accented English Automatic Speech Recognition
Abstract
:1. Introduction
2. System Structure
2.1. Acoustic Modeling
- TDNN-F: The TDNN-F model is the first 11 layers of TDNN-F in the recipe for CHIME6 of the Kaldi (egs/chime6/s5_track1/local/chain/tuning/run_tdnn_1b.sh).
- CNN-TDNNF-Attention: The CNN-TDNNF-Attention model consists of one CNN layer followed by 11 time-delay layers and a time-restricted self-attention layer [21], and we applied a SpecAugment [22] layer on top of the architecture to make it more robust. The CNN layer has a kernel size of 3 × 3 and a filter size of 64. The 11-layer TDNN-F shares the same configuration as the previously illustrated TDNN-F, except it substitutes the first TDNN layer with a TDNN-F layer, which has 1536 nodes, 256 bottleneck nodes, and no time stride. The attention block has eight heads, the value-dim and key-dim are set to 128 and 64, respectively, the context-width is 10 with the same number of left and right inputs, and the time stride is 3.
- Multistream CNN: We positioned a 5-layer CNN to better accommodate the top SpecAugment layer, followed by an 11-layer multistream CNN [14].
2.2. Multistream CNN Architecture
2.3. Accent/Speaker Embeddings
2.4. Neural-Network Alignment
2.5. Language Model Rescoring
3. Experimental Results
3.1. Data Sets and Augmentation
3.2. Effect of Acoustic Model
3.3. Effect of Accent/speaker Embeddings
3.4. Effect of Language Model Rescoring
3.5. Results of Different Countries’ Accents
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Feng, S.; Kudina, O.; Halpern, B.M.; Scharenborg, O. Quantifying bias in automatic speech recognition. arXiv 2021, arXiv:2103.15122. [Google Scholar]
- Vergyri, D.; Lamel, L.; Gauvain, J.L. Automatic speech recognition of multiple accented English data. In Proceedings of the Eleventh Annual Conference of the International Speech Communication Association, Chiba, Japan, 26–30 September 2010. [Google Scholar]
- Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.V.; Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv 2019, arXiv:1901.02860. [Google Scholar]
- Gao, Q.; Wu, H.; Sun, Y.; Duan, Y. An End-to-End Speech Accent Recognition Method Based on Hybrid CTC/Attention Transformer ASR. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 7253–7257. [Google Scholar] [CrossRef]
- Li, S.; Ouyang, B.; Liao, D.; Xia, S.; Li, L.; Hong, Q. End-To-End Multi-Accent Speech Recognition with Unsupervised Accent Modelling. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6418–6422. [Google Scholar] [CrossRef]
- Chen, Y.C.; Yang, Z.; Yeh, C.F.; Jain, M.; Seltzer, M.L. Aipnet: Generative adversarial pre-training of accent-invariant networks for end-to-end speech recognition. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 6979–6983. [Google Scholar]
- Na, H.J.; Park, J.S. Accented Speech Recognition Based on End-to-End Domain Adversarial Training of Neural Networks. Appl. Sci. 2021, 11, 8412. [Google Scholar] [CrossRef]
- Tan, T.; Lu, Y.; Ma, R.; Zhu, S.; Guo, J.; Qian, Y. AISpeech-SJTU ASR System for the Accented English Speech Recognition Challenge. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6413–6417. [Google Scholar] [CrossRef]
- Shi, X.; Yu, F.; Lu, Y.; Liang, Y.; Feng, Q.; Wang, D.; Qian, Y.; Xie, L. The Accented English Speech Recognition Challenge 2020: Open Datasets, Tracks, Baselines, Results and Methods. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6918–6922. [Google Scholar] [CrossRef]
- Najafian, M.; Russell, M. Automatic accent identification as an analytical tool for accent robust automatic speech recognition. Speech Commun. 2020, 122, 44–55. [Google Scholar] [CrossRef]
- Ahmed, A.; Tangri, P.; Panda, A.; Ramani, D.; Karmakar, S. VFNet: A Convolutional Architecture for Accent Classification. In Proceedings of the 2019 IEEE 16th India Council International Conference (INDICON), Rajkot, India, 13–15 December 2019; pp. 1–4. [Google Scholar] [CrossRef] [Green Version]
- Waibel, A.; Hanazawa, T.; Hinton, G.; Shikano, K.; Lang, K.J. Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech, Signal Process. 1989, 37, 328–339. [Google Scholar] [CrossRef]
- Povey, D.; Cheng, G.; Wang, Y.; Li, K.; Xu, H.; Yarmohammadi, M.; Khudanpur, S. Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 3743–3747. [Google Scholar]
- Han, K.J.; Pan, J.; Tadala, V.K.N.; Ma, T.; Povey, D. Multistream CNN for robust acoustic modeling. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 6873–6877. [Google Scholar]
- Han, K.J.; Prieto, R.; Wu, K.; Ma, T. State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention With Dilated 1D Convolutions. arXiv 2019, arXiv:1910.00716. [Google Scholar]
- Chen, M.; Yang, Z.; Liang, J.; Li, Y.; Liu, W. Improving deep neural networks based multi-accent Mandarin speech recognition using i-vectors and accent-specific top layer. In Proceedings of the INTERSPEECH 2015, Dresden, Germany, 6–10 September 2015; pp. 3620–3624. [Google Scholar]
- Turan, M.A.T.; Vincent, E.; Jouvet, D. Achieving multi-accent ASR via unsupervised acoustic model adaptation. In Proceedings of the INTERSPEECH 2020, Shanghai, China, 25–29 October 2020. [Google Scholar]
- Karafiát, M.; Veselý, K.; Černocký, J.H.; Profant, J.; Nytra, J.; Hlaváček, M.; Pavlíček, T. Analysis of X-Vectors for Low-Resource Speech Recognition. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6998–7002. [Google Scholar] [CrossRef]
- Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlicek, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi Speech Recognition Toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Waikoloa, HI, USA, 11–15 December 2011. [Google Scholar]
- Povey, D.; Peddinti, V.; Galvez, D.; Ghahremani, P.; Manohar, V.; Na, X.; Wang, Y.; Khudanpur, S. Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI. In Proceedings of the Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016; pp. 2751–2755. [Google Scholar]
- Povey, D.; Hadian, H.; Ghahremani, P.; Li, K.; Khudanpur, S. A Time-Restricted Self-Attention Layer for ASR. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5874–5878. [Google Scholar]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 2613–2617. [Google Scholar]
- Kenny, P.; Boulianne, G.; Ouellet, P.; Dumouchel, P. Joint factor analysis versus eigenchannels in speaker recognition. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 1435–1447. [Google Scholar] [CrossRef] [Green Version]
- Dehak, N. Discriminative and Generative Approaches for Long-and Short-Term Speaker Characteristics Modeling: Application to Speaker Verification. Ph.D. Thesis, École de Technologie Supérieure, Montreal, QC, Canada, 2009. [Google Scholar]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015. [Google Scholar]
- Ko, T.; Peddinti, V.; Povey, D.; Seltzer, M.L.; Khudanpur, S. A study on data augmentation of reverberant speech for robust speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5220–5224. [Google Scholar]
Layer | Layer Type | Context | Size |
---|---|---|---|
Frame1 | TDNN | {t − 4: t + 4} | 512 |
Frame2 | TDNN | {t − 2, t, t + 2} | 512 |
Frame3 | TDNN | {t − 3, t, t + 3} | 512 |
Frame4 | TDNN | {t} | 512 |
Frame5 | TDNN | {t} | 1500 |
Stat pool | {0,T} | 2 × 1500 | |
Segment6 | Affine | {0} | 512 |
Segment7 | Affine | {0} | 512 |
Softmax | {0} | Num.accent/spk |
System | Features | WERs (%) on Dev | WERs (%) on Eval |
---|---|---|---|
TDNN-F (baseline) | MFCC | 9.12 | 9.31 |
TDNN-F | MFCC + Pitch | 8.97 | 9.18 |
CNN-TDNNF-Attention | MFCC + Pitch | 8.92 | 9.12 |
Multistream CNN | MFCC + Pitch | 8.86 | 9.08 |
Embeddings | Dev | Eval |
---|---|---|
[M1] w/o embeddings | 8.86 | 9.08 |
[M2] Spk-ivectors | 7.18 | 8.01 |
[M3] Accent-ivectors | 7.17 | 7.95 |
[M4] + spk-ivectors | 7.02 | 8.02 |
[M5] Spk-xvectors | 7.04 | 7.76 |
[M6] Accent-xvectors | 7.02 | 7.89 |
[M7] + spk-xvectors | 6.95 | 7.74 |
System | Features | Embeddings | WER (%) on Dev | WER (%) on Eval |
---|---|---|---|---|
Baseline (TDNN-F) | MFCC | - | 9.12 | 9.31 |
Multistream CNN | MFCC + Pitch | accent-xvectors +spk-xvectors | 6.95 | 7.74 |
Multistream CNN +LM rescoring | MFCC + Pitch | accent-xvectors +spk-xvectors | 5.41 | 5.99 |
Country | WER (%) on Dev | WER (%) on Eval |
---|---|---|
China | 8.10 | 9.29 |
Japan | 4.08 | 4.01 |
India | 6.79 | 7.24 |
USA | 5.82 | 4.29 |
Britain | 2.90 | 3.87 |
Portugal | 4.42 | 4.57 |
Russia | 6.23 | 6.83 |
Korea | 4.86 | 3.99 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tong, F.; Li, T.; Liao, D.; Xia, S.; Li, S.; Hong, Q.; Li, L. The XMUSPEECH System for Accented English Automatic Speech Recognition. Appl. Sci. 2022, 12, 1478. https://doi.org/10.3390/app12031478
Tong F, Li T, Liao D, Xia S, Li S, Hong Q, Li L. The XMUSPEECH System for Accented English Automatic Speech Recognition. Applied Sciences. 2022; 12(3):1478. https://doi.org/10.3390/app12031478
Chicago/Turabian StyleTong, Fuchuan, Tao Li, Dexin Liao, Shipeng Xia, Song Li, Qingyang Hong, and Lin Li. 2022. "The XMUSPEECH System for Accented English Automatic Speech Recognition" Applied Sciences 12, no. 3: 1478. https://doi.org/10.3390/app12031478
APA StyleTong, F., Li, T., Liao, D., Xia, S., Li, S., Hong, Q., & Li, L. (2022). The XMUSPEECH System for Accented English Automatic Speech Recognition. Applied Sciences, 12(3), 1478. https://doi.org/10.3390/app12031478