Unsupervised Representation Learning with Task-Agnostic Feature Masking for Robust End-to-End Speech Recognition
Abstract
:1. Introduction
- We propose a novel unsupervised SVR learning method for robust ASR performance in practical applications.
- We demonstrate that incorporating a noise masking strategy into various combinations of the time–frequency regions of spectrum features makes the SVR extractor more robust, and that the speech recognition performance using our proposed method significantly outperforms existing methods in real conditions.
- We provide ASR performance for the SVR extractor trained with real speech datasets of varying sizes (1 k–18 k h) and present ASR performance for speech datasets that were not used for pretraining of the SVR. To obtain an accurate comparison, we report the speech recognition performance for four different conditions, which experimentally shows that our unsupervised masking method is effective.
- To the best of our knowledge, this is the first attempt at pretraining an SVR model with large-scale real Korean speech. We further explore and provide a wide range of ablation studies and analyses on the results of practical ASR using various masking combinations of the time–frequency regions and two noise masking techniques.
2. Related Work
3. Method
3.1. SVR Learning with Masking Method
3.1.1. Masking Time–Frequency Feature
3.1.2. Noise Masking Techniques
3.2. SVR Architecture
3.3. ASR Architecture
4. Experimental Setup
4.1. Data
- Baseline condition: The ASR model trained using the mel filterbank that is directly converted from waveform (no pretraining).
- Matched condition: The same dataset is both utilized in the pretraining phase of the SVR model and the training phase of the ASR model using extracted features from the SVR model.
- Unmatched condition: The dataset that was employed for pretraining the SVR model and for training the ASR model with features extracted from the SVR model is disparate.
- Multi condition: The SVR model is pretrained with additional speech datasets, in addition to those used for ASR training. In other words, we pretrain the SVR model with more datasets than the ASR model.
4.2. Pretraining Details
4.2.1. Masking
4.2.2. Optimization
4.3. ASR Training Details
4.4. Software Details
5. Experimental Results and Analysis
5.1. Main Results
5.2. Ablation: Impact of Time Masking Hyperparameter
5.3. Ablation: Impact of Frequency Masking Hyperparameter
5.4. Ablation: Impact of Two Noise Masking Hyperparameters to Perturb the Speech
6. Discussion
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. arXiv 2022, arXiv:2212.04356. [Google Scholar]
- Garofolo, J.; Lamel, L.; Fisher, W.; Fiscus, J.; Pallet, D.; Dahlgren, N. The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CDROM; Linguistic Data Consortium: Philadelphia, PA, USA, 1993. [Google Scholar]
- Garofolo, J.; Graff, D.; Paul, D.; Pallett, D. CSR-I (WSJ0) Complete LDC93S6A; Web Download; Linguistic Data Consortium: Philadelphia, PA, USA, 1993. [Google Scholar]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
- Kahn, J.; Rivière, M.; Zheng, W.; Kharitonov, E.; Xu, Q.; Mazaré, P.E.; Karadayi, J.; Liptchinsky, V.; Collobert, R.; Fuegen, C.; et al. Libri-light: A benchmark for asr with limited or no supervision. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7669–7673. [Google Scholar]
- Gong, Y. Speech recognition in noisy environments: A survey. Speech Commun. 1995, 16, 261–291. [Google Scholar] [CrossRef]
- Rajnoha, J.; Pollák, P. ASR systems in noisy environment: Analysis and solutions for increasing noise robustness. Radioengineering 2011, 20, 74–84. [Google Scholar]
- Li, J.; Deng, L.; Gong, Y.; Haeb-Umbach, R. An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 745–777. [Google Scholar] [CrossRef]
- Potamianos, A.; Narayanan, S.; Lee, S. Automatic speech recognition for children. In Proceedings of the Fifth European Conference on Speech Communication and Technology, Rhodes, Greece, 22–25 September 1997. [Google Scholar]
- Potamianos, A.; Narayanan, S. Robust recognition of children’s speech. IEEE Trans. Speech Audio Process. 2003, 11, 603–616. [Google Scholar] [CrossRef] [Green Version]
- Wilpon, J.G.; Jacobsen, C.N. A study of speech recognition for children and the elderly. In Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, GA, USA, 9 May 1996; Volume 1, pp. 349–352. [Google Scholar]
- Anderson, S.; Liberman, N.; Bernstein, E.; Foster, S.; Cate, E.; Levin, B.; Hudson, R. Recognition of elderly speech and voice-driven document retrieval. In Proceedings of the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings ICASSP99 (Cat. No. 99CH36258), Phoenix, AZ, USA, 15–19 March 1999; Volume 1, pp. 145–148. [Google Scholar]
- Kim, J.W.; Yoon, H.; Jung, H.Y. Linguistic-Coupled Age-to-Age Voice Translation to Improve Speech Recognition Performance in Real Environments. IEEE Access 2021, 9, 136476–136486. [Google Scholar] [CrossRef]
- Shrawankar, U.; Thakare, V. Adverse Conditions and ASR Techniques for Robust Speech User Interface. Int. J. Comput. Sci. Issues (IJCSI) 2011, 8, 440. [Google Scholar]
- Chavan, K.; Gawande, U. Speech recognition in noisy environment, issues and challenges: A review. In Proceedings of the 2015 International Conference on Soft-Computing and Networks Security (ICSNS), Coimbatore, India, 25–27 February 2015; pp. 1–5. [Google Scholar]
- Weintraub, M.; Taussig, K.; Hunicke-Smith, K.; Snodgrass, A. Effect of speaking style on LVCSR performance. In Proceedings of the 4th International Conference on Spoken Language Processing, Philadelphia, PA, USA, 3–6 October 1996; Volume 96, pp. 16–19. [Google Scholar]
- Benzeghiba, M.; De Mori, R.; Deroo, O.; Dupont, S.; Erbes, T.; Jouvet, D.; Fissore, L.; Laface, P.; Mertins, A.; Ris, C.; et al. Automatic speech recognition and speech variability: A review. Speech Commun. 2007, 49, 763–786. [Google Scholar] [CrossRef] [Green Version]
- Young, V.; Mihailidis, A. Difficulties in automatic speech recognition of dysarthric speakers and implications for speech-based applications used by the elderly: A literature review. Assist. Technol. 2010, 22, 99–112. [Google Scholar] [CrossRef] [PubMed]
- Kim, J.W.; Yoon, H.; Jung, H.Y. Improved Spoken Language Representation for Intent Understanding in a Task-Oriented Dialogue System. Sensors 2022, 22, 1509. [Google Scholar] [CrossRef] [PubMed]
- Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
- Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised pretraining for Speech Recognition. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 3465–3469. [Google Scholar] [CrossRef] [Green Version]
- Chung, Y.A.; Hsu, W.N.; Tang, H.; Glass, J. An Unsupervised Autoregressive Model for Speech Representation Learning. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 146–150. [Google Scholar] [CrossRef] [Green Version]
- Ling, S.; Liu, Y.; Salazar, J.; Kirchhoff, K. Deep contextualized acoustic representations for semi-supervised speech recognition. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6429–6433. [Google Scholar]
- Liu, A.T.; Yang, S.W.; Chi, P.H.; Hsu, P.C.; Lee, H.Y. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6419–6423. [Google Scholar]
- Wang, W.; Tang, Q.; Livescu, K. Unsupervised pretraining of bidirectional speech encoders via masked reconstruction. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6889–6893. [Google Scholar]
- Park, D.S.; Zhang, Y.; Jia, Y.; Han, W.; Chiu, C.C.; Li, B.; Wu, Y.; Le, Q.V. Improved Noisy Student Training for Automatic Speech Recognition. Proc. Interspeech 2020 2020, 2817–2821. [Google Scholar] [CrossRef]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
- Zhang, Y.; Qin, J.; Park, D.S.; Han, W.; Chiu, C.C.; Pang, R.; Le, Q.V.; Wu, Y. Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition. In Proceedings of the NeurIPS 2020 workshop: Self-Supervised Learning for Speech and Audio Processing, Virtual Conference, 11 December 2020. arXiv:2010.10504. [Google Scholar]
- Liu, A.T.; Li, S.W.; Lee, H.Y. Tera: Self-supervised learning of transformer encoder representation for speech. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 2351–2366. [Google Scholar] [CrossRef]
- Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
- Chung, Y.A.; Zhang, Y.; Han, W.; Chiu, C.C.; Qin, J.; Pang, R.; Wu, Y. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pretraining. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; pp. 244–250. [Google Scholar]
- Rivière, M.; Dupoux, E. Towards unsupervised learning of speech features in the wild. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 156–163. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (Long and Short Papers), pp. 4171–4186. [Google Scholar] [CrossRef]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 2613–2617. [Google Scholar] [CrossRef] [Green Version]
- Baevski, A.; Schneider, S.; Auli, M. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Chi, P.H.; Chung, P.H.; Wu, T.H.; Hsieh, C.C.; Chen, Y.H.; Li, S.W.; Lee, H.y. Audio albert: A lite bert for self-supervised learning of audio representation. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 344–350. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
- Müller, R.; Kornblith, S.; Hinton, G.E. When does label smoothing help? In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
- Bang, J.U.; Yun, S.; Kim, S.H.; Choi, M.Y.; Lee, M.K.; Kim, Y.J.; Kim, D.H.; Park, J.; Lee, Y.J.; Kim, S.H. Ksponspeech: Korean spontaneous speech corpus for automatic speech recognition. Appl. Sci. 2020, 10, 6936. [Google Scholar] [CrossRef]
- AIHub. Korean Foreign Word Speech. Available online: https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=131 (accessed on 6 January 2023).
- AIHub. Korean Command Speech (Adult). Available online: https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=96 (accessed on 6 January 2023).
- AIHub. Korean Free Conversation Speech (Adult). Available online: https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=109 (accessed on 6 January 2023).
- AIHub. Korean Command Speech (Child). Available online: https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=95 (accessed on 6 January 2023).
- AIHub. Korean Free Conversation Speech (Child). Available online: https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=108 (accessed on 6 January 2023).
- AIHub. Korean Command Speech (Elderly). Available online: https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=94 (accessed on 6 January 2023).
- AIHub. Korean Free Conversation Speech (Elderly). Available online: https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=107 (accessed on 6 January 2023).
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
- Yang, Y.Y.; Hira, M.; Ni, Z.; Astafurov, A.; Chen, C.; Puhrsch, C.; Pollack, D.; Genzel, D.; Greenberg, D.; Yang, E.Z.; et al. Torchaudio: Building blocks for audio and speech processing. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6982–6986. [Google Scholar]
- Harris, C.R.; Millman, K.J.; Van Der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef] [PubMed]
- Liu, A.H.; Chung, Y.A.; Glass, J. Non-Autoregressive Predictive Coding for Learning Speech Representations from Local Dependencies. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 3730–3734. [Google Scholar] [CrossRef]
- Graves, A. Sequence transduction with recurrent neural networks. arXiv 2012, arXiv:1211.3711. [Google Scholar]
- Rao, K.; Sak, H.; Prabhavalkar, R. Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer. In Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan, 16–20 December 2017; pp. 193–199. [Google Scholar]
- Chen, X.; Wu, Y.; Wang, Z.; Liu, S.; Li, J. Developing real-time streaming transformer transducer for speech recognition on large-scale dataset. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 5904–5908. [Google Scholar]
- Kim, K.; Lee, K.; Gowda, D.; Park, J.; Kim, S.; Jin, S.; Lee, Y.Y.; Yeo, J.; Kim, D.; Jung, S.; et al. Attention based on-device streaming speech recognition with large speech corpus. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 956–963. [Google Scholar]
- He, Y.; Sainath, T.N.; Prabhavalkar, R.; McGraw, I.; Alvarez, R.; Zhao, D.; Rybach, D.; Kannan, A.; Wu, Y.; Pang, R.; et al. Streaming end-to-end speech recognition for mobile devices. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 April 2019; pp. 6381–6385. [Google Scholar]
- Moritz, N.; Hori, T.; Le, J. Streaming automatic speech recognition with the transformer model. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6074–6078. [Google Scholar]
- Shi, Y.; Wang, Y.; Wu, C.; Yeh, C.F.; Chan, J.; Zhang, F.; Le, D.; Seltzer, M. Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6783–6787. [Google Scholar]
Name | Hours | No. Utterances |
---|---|---|
KsponSpeech [40] | 921 | 517,144 |
KForeignWordSpeech [41] | 3155 | 2,484,843 |
KCommandSpeech (Adult) [42] | 1718 | 1,742,211 |
KCommandSpeech (Child) [44] | 2114 | 2,262,551 |
KCommandSpeech (Elderly) [46] | 2410 | 2,137,981 |
KFreeConvSpeech (Adult) [43] | 3186 | 2,235,385 |
KFreeConvSpeech (Child) [45] | 2377 | 2,389,409 |
KFreeConvSpeech (Elderly) [47] | 2511 | 1,145,652 |
Sum | 18,392 | 14,915,176 |
Dataset | File Format | Sampling Frequency | Mode | Average Duration (s) | Max Duration (s) | No. Speakers (Male/Female) |
---|---|---|---|---|---|---|
[40] | pcm/wav | 16 kHz | Mono | 6.41 | 31.00 | 923/1077 |
[41] | 44 and 16 kHz | 4.57 | 24.96 | 1000/1000 | ||
[42] | 48 kHz | 3.55 | 21.42 | 1751/1751 | ||
[44] | 48 kHz | 3.36 | 24.18 | 1500/1500 | ||
[46] | 48 kHz | 4.06 | 24.90 | 1500/1500 | ||
[43] | 44 and 16 kHz | 5.13 | 24.20 | 1000/1000 | ||
[45] | 44 and 16 kHz | 3.58 | 17.32 | 500/500 | ||
[47] | 44 and 16 kHz | 7.88 | 24.99 | 500/500 |
Name | Datasets | Hours | No. Utterances |
---|---|---|---|
SVR1K | [40] | 921 | 517,144 |
SVR3K | [41] | 3155 | 2,484,843 |
SVR4K | [40,41] | 4076 | 3,001,987 |
SVR7K | [40,42,44,46] | 7163 | 6,659,887 |
SVR9K | [40,43,45,47] | 8995 | 6,287,590 |
SVR10K | [40,41,42,44,46] | 10,318 | 9,144,730 |
SVR18K | [40,41,42,43,44,45,46,47] | 18,392 | 14,915,176 |
Name | Datasets | Hours | No. Utterances | Training Steps (100 epochs) |
---|---|---|---|---|
SVR1K | [40] | 921 | 517,144 | 808,050 |
SVR3K | [41] | 3155 | 2,484,843 | 3,882,575 |
SVR4K | [40,41] | 4076 | 3,001,987 | 4,690,625 |
SVR7K | [40,42,44,46] | 7163 | 6,659,887 | 10,406,075 |
SVR9K | [40,43,45,47] | 8995 | 6,287,590 | 9,824,375 |
SVR10K | [40,41,42,44,46] | 10,318 | 9,144,730 | 14,288,650 |
SVR18K | [40,41,42,43,44,45,46,47] | 18,392 | 14,915,176 | 23,304,975 |
Pretraining Methods | Network | No. Model Params | CER ↓ (%) |
---|---|---|---|
CPC [20] | Recurrent | 12,931,584 | 13.94 |
APC [22] | Recurrent | 9,107,024 | 14.78 |
NPC [52] | Recurrent | 19,380,560 | 13.36 |
Mockingjay [24] | Parallel | 22,226,928 | 16.95 |
AALBERT [36] | Parallel | 7,805,264 | 17.25 |
TERA [29] | Parallel | 21,981,008 | 13.86 |
Ours | Parallel | 21,981,008 | 12.32 |
Conditions | Name | No. Unlabeled Data (h) | CER ↓ (%) | ERR ↑ (%) |
---|---|---|---|---|
Baseline | - | - | 15.17 | - |
Matched | SVR1K | 921 | 12.32 | 18.79 |
Unmatched | SVR3K | 3155 | 13.18 | 13.10 |
Multi | SVR4K | 4076 | 12.23 | 19.37 |
SVR7K | 7163 | 12.54 | 17.32 | |
SVR9K | 8995 | 12.09 | 20.31 | |
SVR10K | 10,318 | 12.38 | 18.38 | |
SVR18K | 18,392 | 11.72 | 22.77 |
Conditions | Name | No. Unlabeled Data (h) | CER ↓ (%) | ERR ↑ (%) |
---|---|---|---|---|
Baseline | - | - | 4.77 | - |
Matched | SVR3K | 3155 | 1.09 | 77.15 |
Unmatched | SVR1K | 921 | 4.24 | 11.11 |
SVR4K | 4076 | 0.98 | 79.45 | |
Multi | SVR10K | 10,318 | 0.95 | 80.08 |
SVR18K | 18,392 | 0.9 | 81.13 |
CER ↓ (%) | ERR ↑ (%) | ||||
---|---|---|---|---|---|
0.1 | 13.87 | 8.57 | |||
0.15 | 13.86 | 8.64 | |||
0.2 | 0.2 | 0.1 | 0.1 | 13.17 | 13.18 |
0.3 | 12.94 | 14.70 | |||
0.4 | 13.45 | 11.33 |
CER ↓ (%) | ERR ↑ (%) | ||||
---|---|---|---|---|---|
0.15 | 0 | 0.1 | 0.1 | 15.31 | −0.92 |
0.1 | 13.85 | 8.70 | |||
0.15 | 13.80 | 9.03 | |||
0.2 | 13.86 | 8.64 | |||
0.3 | 13.24 | 12.72 | |||
0.4 | 12.32 | 18.79 |
CER ↓ (%) | ERR ↑ (%) | Avg CER ↓ (%) | ||||
---|---|---|---|---|---|---|
0.15 | 0.2 | 0.0 | 0.0 | 13.74 | 9.43 | 13.65 |
0.1 | 13.08 | 13.78 | ||||
0.2 | 13.23 | 12.79 | ||||
0.3 | 13.81 | 8.97 | ||||
0.4 | 14.16 | 6.66 | ||||
0.5 | 13.90 | 8.37 | ||||
0.1 | 0.0 | 13.86 | 8.64 | 13.34 | ||
0.1 | 12.98 | 14.44 | ||||
0.2 | 13.13 | 13.45 | ||||
0.3 | 13.58 | 10.48 | ||||
0.4 | 13.14 | 13.38 | ||||
0.5 | 13.36 | 11.93 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, J.-W.; Chung, H.; Jung, H.-Y. Unsupervised Representation Learning with Task-Agnostic Feature Masking for Robust End-to-End Speech Recognition. Mathematics 2023, 11, 622. https://doi.org/10.3390/math11030622
Kim J-W, Chung H, Jung H-Y. Unsupervised Representation Learning with Task-Agnostic Feature Masking for Robust End-to-End Speech Recognition. Mathematics. 2023; 11(3):622. https://doi.org/10.3390/math11030622
Chicago/Turabian StyleKim, June-Woo, Hoon Chung, and Ho-Young Jung. 2023. "Unsupervised Representation Learning with Task-Agnostic Feature Masking for Robust End-to-End Speech Recognition" Mathematics 11, no. 3: 622. https://doi.org/10.3390/math11030622