Phoneme Recognition in Korean Singing Voices Using Self-Supervised English Speech Representations
Abstract
:1. Introduction
- An SSL-based detouring method to overcome the insufficient speech and song dataset was explored for the recognition of Korean phonemes in speech and singing voices. The method is quite general for the phoneme recognition of speech and the singing voices of any other language where the labelled data are insufficient.
- In the proposed model, the HuBERT-type SSL model which was pre-trained on a large-scale English model proved to be useful for capturing the general representation of linguistic phoneme, which is adapted to Korean phoneme in the following stage with a relatively small-scale Korean dataset.
- In the speech–singing voice adaptation, melodic supervision by employing multi-task learning proved to be useful for better phoneme recognition.
2. Related Work
2.1. Self-Supervised Speech Representation Learning
2.2. Automatic Lyric Transcription
3. Method
3.1. Self-Supervised English Speech Representation Learning
3.2. English–Korean Adaptation
3.3. Speech–Singing Adaptation
4. Experiments
4.1. Data Preparation and Performance Evaluation
4.1.1. Datasets
4.1.2. Preprocessing of the NIKLSEOUL and CSD Corpora
4.1.3. Evaluation Metrics
4.2. Experimental Details
4.2.1. HuBERT-Based SSL and Adaptation Models
4.2.2. Dataset Division
4.2.3. Training Setup
4.3. Experimental Results
4.3.1. English–Korean Adaptation
4.3.2. Speech–Singing Adaptation
5. Conclusions and Further Research
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Li, J. Recent advances in end-to-end automatic speech recognition. APSIPA Trans. Signal Inf. Process. 2022, 11, e8. [Google Scholar] [CrossRef]
- Gulati, A.; Qin, J.; Chiu, C.-C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar]
- Malik, M.; Malik, M.K.; Mehmood, K.; Makhdoom, I. Automatic speech recognition: A survey. Multimed. Tools Appl. 2021, 80, 9411–9457. [Google Scholar] [CrossRef]
- Demirel, E.; Ahlbäck, S.; Dixon, S. Automatic lyrics transcription using dilated convolutional neural networks with self-attention. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–8. [Google Scholar]
- Gupta, C.; Yılmaz, E.; Li, H. Automatic lyrics alignment and transcription in polyphonic music: Does background music help? In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 4–9 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 496–500. [Google Scholar]
- Hosoya, T.; Suzuki, M.; Ito, A.; Makino, S.; Smith, L.A.; Bainbridge, D.; Witten, I.H. Lyrics recognition from a singing voice based on finite state automaton for music information retrieval. In Proceedings of the 6th International Society for Music Information Retrieval Conference (ISMIR), London, UK, 11–15 September 2005; pp. 532–535. [Google Scholar]
- Fujihara, H.; Goto, M.; Ogata, J. Hyperlinking lyrics: A method for creating hyperlinks between phrases in song lyrics. In Proceedings of the 9th International Society for Music Information Retrieval Conference (ISMIR), Philadelphia, PA, USA, 14–18 September 2008; pp. 281–286. [Google Scholar]
- Dzhambazov, G. Knowledge-Based Probabilistic Modeling for Tracking Lyrics in Music Audio Signals. Ph.D. Thesis, Universitat Pompeu Fabra, Barcelona, Spain, 2017. [Google Scholar]
- Yong, S.; Su, L.; Nam, J. A phoneme-informed neural network model for note-level singing transcription. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
- Gao, X.; Gupta, C.; Li, H. Automatic lyrics transcription of polyphonic music with lyrics-chord multi-task learning. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 2280–2294. [Google Scholar] [CrossRef]
- Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
- Hsu, W.-N.; Bolte, B.; Tsai, Y.-H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
- Mohamed, A.; Lee, H.-Y.; Borgholt, L.; Havtorn, J.D.; Edin, J.; Igel, C.; Kirchhoff, K.; Li, S.-W.; Livescu, K.; Maaløe, L.; et al. Self-supervised speech representation learning: A review. IEEE J. Sel. Top. Signal Process. 2022, 16, 1179–1210. [Google Scholar] [CrossRef]
- Huang, J.; Benetos, E.; Ewert, S. Improving lyrics alignment through joint pitch detection. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 7–13 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 451–455. [Google Scholar]
- Chung, Y.-A.; Wu, C.-C.; Shen, C.-H.; Lee, H.-Y.; Lee, L.-S. Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder. arXiv 2016, arXiv:1603.00982. [Google Scholar]
- Liu, A.H.; Chung, Y.-A.; Glass, J. Non-autoregressive predictive coding for learning speech representations from local dependencies. arXiv 2020, arXiv:2011.00406. [Google Scholar]
- van den Oord, A.; Li, Y.; Vinyals, O. Representa-tion learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
- Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised pre-training for speech recognition. arXiv 2019, arXiv:1904.05862. [Google Scholar]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
- Baevski, A.; Schneider, S.; Auli, M. vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv 2019, arXiv:1910.05453. [Google Scholar]
- Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
- Meseguer-Brocal, G.; Cohen-Hadria, A.; Peeters, G. Dali: A large dataset of synchronized audio, lyrics and notes, auto-matically created using teacher-student machine learning paradigm. arXiv 2019, arXiv:1906.10606. [Google Scholar]
- Dabike, G.R.; Barker, J. Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System. Interspeech. 2019, pp. 579–583. Available online: https://www.isca-archive.org/interspeech_2019/dabike19_interspeech.html (accessed on 21 September 2024).
- Zhang, C.; Yu, J.; Chang, L.; Tan, X.; Chen, J.; Qin, T.; Zhang, K. Pdaugment: Data augmentation by pitch and duration adjustments for automatic lyrics transcription. arXiv 2021, arXiv:2109.07940. [Google Scholar]
- Anna, M.K. Training phoneme models for singing with “songified” speech data. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR 2015), Malaga, Spain, 26–30 October 2015; Volume 30, p. 50. [Google Scholar]
- Mesaros, A.; Virtanen, T. Adaptation of a speech recognizer for singing voice. In Proceedings of the 2009 17th European Signal Processing Conference, Scotland, UK, 24–28 August 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 1779–1783. [Google Scholar]
- Mesaros, A. Singing voice identification and lyrics tran- scription for music information retrieval invited paper. In Proceedings of the 2013 7th Conference on Speech Technology and Human-Computer Dialogue (SpeD), Cluj-Napoca, Romania, 16–19 October 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 1–10. [Google Scholar]
- Ou, L.; Gu, X.; Wang, Y. Transfer learning of wav2vec 2.0 for automatic lyric transcription. arXiv 2022, arXiv:2207.09747. [Google Scholar]
- Deng, T.; Nakamura, E.; Yoshii, K. End-to-end lyrics transcription informed by pitch and onset estimation. In Proceedings of the 23rd International Society for Music Information Retrieval Conference, ISMIR, Bengaluru, India, 4–8 December 2022. [Google Scholar]
- Cho, J.; Park, H.-K. A comparative analysis of korean-english phonological structures and processes for pronunciation pedagogy in interpretation training. Meta 2006, 51, 229–246. [Google Scholar] [CrossRef]
- O’Grady, W.; Archibald, J.; Aronoff, M.; Rees-Miller, J. Contemporary Linguistics: An Introduction; St. Martin’s: Bedford, UK, 2017. [Google Scholar]
- Nichols, E.; Morris, D.; Basu, S.; Raphael, C. Relationships between lyrics and melody in popular music. In Proceedings of the ISMIR 2009-11th International Society for Music Information Retrieval Conference, Utrecht, The Netherlands, 9–13 August 2009; pp. 471–476. [Google Scholar]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 5206–5210. [Google Scholar]
- National Institute of Korean Language. Nikl the Corpus of Reading Aloud Korean Short Stories by Seoul People (v.2.0). 2021. Available online: https://kli.korean.go.kr/corpus/main/requestMain.do (accessed on 21 September 2024).
- Choi, S.; Kim, W.; Park, S.; Yong, S.; Nam, J. Children’s song dataset for singing voice research. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Montreal, Canada, 11–16 October 2020. [Google Scholar]
- Kyubyong Park. g2pk. 2019. Available online: https://github.com/Kyubyong/g2pk (accessed on 21 September 2024).
- National Institute of Korean Language. ‘Revised Romanization of Korean’. Available online: https://www.korean.go.kr/front_eng/roman/roman_01.do (accessed on 21 September 2024).
- Molina, E.; Barbancho-Perez, A.M.; Tardon-Garcia, L.J.; Barbancho-Perez, I. Evaluation framework for automatic singing transcription. In Proceedings of the 15th International Society for Music Information Retrieval Conference, Taipei, Taiwan, 27–31 October 2014; 2014; pp. 567–572. [Google Scholar]
- Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.; Grangier, D.; Auli, M. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the NAACL-HLT 2019: Demonstrations, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv 2019, arXiv:1904.08779. [Google Scholar]
- Singhal, P.; Walambe, R.; Ramanna, S.; Kotecha, K. Domain adaptation: Challenges, methods, datasets, and applications. IEEE Access 2023, 11, 6973–7020. [Google Scholar] [CrossRef]
Bilabial | Labiodental | Dental | Alveolar | Palatoalveolar | Palatal | Velar | Labiovelar | Glottal | |
---|---|---|---|---|---|---|---|---|---|
Common | p m | t s n | t∫ | j | k ŋ | w | h | ||
English-only | f v | θ ð | d z l ɹ | ∫ ʒ dʒ | g | ʔ | |||
Korean-only | ɾ | ɯ |
Prior | Fine-Tuning Strategy | Error Rate (%) | |||
---|---|---|---|---|---|
Phoneme ↓ | Syllable ↓ | Consonant ↓ | Vowel ↓ | ||
None | Full | 38.63 | 73.44 | 32.60 | 29.86 |
En | Linear Probe | 21.28 | 49.43 | 17.99 | 18.48 |
En | Full | 7.44 | 19.66 | 5.77 | 6.34 |
Model | Error Rate (%) ↓ | COn | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
CSD Corpus | NIKLSEOUL Corpus | CSD Corpus | |||||||||
Phoneme | Syllable | Consonant | Vowel | Phoneme | Syllable | Consonant | Vowel | Precision (%) ↑ | Recall (%) ↑ | F1-Score (%) ↑ | |
A 1 | 17.64 | 40.93 | 15.71 | 15.13 | 34.74 | 74.96 | 26.94 | 31.39 | \ | \ | \ |
B 2 | 17.76 | 41.08 | 16.36 | 13.69 | 14.30 | 35.80 | 12.09 | 11.17 | 31.60 | 81.28 | 44.93 |
C 3 | 17.09 | 39.48 | 15.32 | 13.50 | 15.68 | 38.62 | 13.30 | 12.67 | 31.48 | 82.89 | 45.13 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wu, W.; Lee, J. Phoneme Recognition in Korean Singing Voices Using Self-Supervised English Speech Representations. Appl. Sci. 2024, 14, 8532. https://doi.org/10.3390/app14188532
Wu W, Lee J. Phoneme Recognition in Korean Singing Voices Using Self-Supervised English Speech Representations. Applied Sciences. 2024; 14(18):8532. https://doi.org/10.3390/app14188532
Chicago/Turabian StyleWu, Wenqin, and Joonwhoan Lee. 2024. "Phoneme Recognition in Korean Singing Voices Using Self-Supervised English Speech Representations" Applied Sciences 14, no. 18: 8532. https://doi.org/10.3390/app14188532
APA StyleWu, W., & Lee, J. (2024). Phoneme Recognition in Korean Singing Voices Using Self-Supervised English Speech Representations. Applied Sciences, 14(18), 8532. https://doi.org/10.3390/app14188532