Korean Pansori Vocal Note Transcription Using Attention-Based Segmentation and Viterbi Decoding
Abstract
:1. Introduction
2. Related Work
3. Method
3.1. Frame-Level Transcription
3.1.1. Input
3.1.2. Attention-Based Semantic Segmentation Framework
3.1.3. Loss Function
3.2. Note-Level Transcription
4. Experimental Results
4.1. Frame-Level Transcription
4.2. Note-Level Transcription
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Kang, B. UNLV Theses, Dissertations, Professional Papers, and Capstones; UNLV Theses: Las Vegas, NV, USA, 2016; p. 2789. [Google Scholar] [CrossRef]
- Um, H. Performing Pansori music drama: Stage, story and sound. Rediscovering Tradit. Korean Perform. Arts 2012, 72. [Google Scholar]
- Bhattarai, B.; Pandeya, Y.R.; Lee, J. Parallel stacked hourglass network for music source separation. IEEE Access 2020, 8, 206016–206027. [Google Scholar] [CrossRef]
- Jouvet, D.; Laprie, Y. Performance analysis of several pitch detection algorithms on simulated and real noisy speech data. In Proceedings of the 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, 28 August–2 September 2017; IEEE: Piscataway, NJ, USA; pp. 1614–1618. [Google Scholar]
- Babacan, O.; Drugman, T.; d’Alessandro, N.; Henrich, N.; Dutoit, T. A comparative study of pitch extraction algorithms on a large variety of singing sounds. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; IEEE: Piscataway, NJ, USA; pp. 7815–7819. [Google Scholar]
- Von Dem Knesebeck, A.; Zölzer, U. Comparison of pitch trackers for real-time guitar effects. In Proceedings of the 13th International Conference on Digital Audio Effects, Graz, Austria, 6–10 September 2010. [Google Scholar]
- Duan, Z.; Temperley, D. Note-level Music Transcription by Maximum Likelihood Sampling. In Proceedings of the 15th International Society for Music Information Retrieval Conference, Taipei, Taiwan, 27–31 October; 2014; pp. 181–186. [Google Scholar]
- Kim, S.; Hayashi, T.; Toda, T. Note-level automatic guitar transcription using attention mechanism. In Proceedings of the 30th European Signal Processing Conference (EUSIPCO), Belgrade, Serbia, 29 August–2 September 2022; IEEE: Piscataway, NJ, USA; pp. 229–233. [Google Scholar]
- Hsu, J.Y.; Su, L. VOCANO: A note transcription framework for singing voice in polyphonic music. In Proceedings of the 22nd International Society for Music Information Retrieval Conference, Online, 7–12 November 2021. [Google Scholar]
- Su, L.; Yang, Y.H. Combining spectral and temporal representations for multipitch estimation of polyphonic music. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 1600–1612. [Google Scholar] [CrossRef]
- Nam, J.; Ngiam, J.; Lee, H.; Slaney, M. A Classification-Based Polyphonic Piano Transcription Approach Using Learned Feature Representations. In Proceedings of the 12th International Society for Music Information Retrieval Conference, Miami, FL, USA, 24–28 October 2011; pp. 175–180. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
- Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context encoding for semantic segmentation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7151–7160. [Google Scholar]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
- Viterbi, A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 1967, 13, 260–269. [Google Scholar] [CrossRef]
- Emiya, V.; Badeau, R.; David, B. Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE Trans. Audio Speech Lang. Process. 2009, 18, 1643–1654. [Google Scholar] [CrossRef]
- Sigtia, S.; Benetos, E.; Dixon, S. An end-to-end neural network for polyphonic piano music transcription. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 927–939. [Google Scholar] [CrossRef]
- Kelz, R.; Dorfer, M.; Korzeniowski, F.; Böck, S.; Arzt, A.; Widmer, G. On the potential of simple framewise approaches to piano transcription. arXiv 2016, arXiv:1612.05153. [Google Scholar]
- Gao, Y.; Zhang, X.; Li, W. Vocal melody extraction via hrnet-based singing voice separation and encoder-decoder-based f0 estimation. Electronics 2021, 10, 298. [Google Scholar] [CrossRef]
- Kong, Q.; Li, B.; Song, X.; Wan, Y.; Wang, Y. High-resolution piano transcription with pedals by regressing onset and offset times. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3707–3717. [Google Scholar] [CrossRef]
- Hawthorne, C.; Stasyuk, A.; Roberts, A.; Simon, I.; Huang, C.Z.; Dieleman, S.; Elsen, E.; Engel, J.; Eck, D. Enabling factorized piano music modeling and generation with the MAESTRO dataset. ICLR. arXiv 2019, arXiv:1810.12247. Available online: https://arxiv.org/abs/1810.12247 (accessed on 25 December 2023).
- Wu, Y.T.; Luo, Y.J.; Chen, T.P.; Wei, I.; Hsu, J.Y.; Chuang, Y.C.; Su, L. Omnizart: A general toolbox for automatic music transcription. arXiv 2021, arXiv:2106.00497. [Google Scholar] [CrossRef]
- Gardner, J.; Simon, I.; Manilow, E.; Hawthorne, C.; Engel, J. MT3: Multi-task multitrack music transcription. In international conference in learning representation. arXiv 2022, arXiv:2111.03017. [Google Scholar]
- Berg-Kirkpatrick, T.; Andreas, J.; Klein, D. Unsupervised transcription of piano music. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
- Ewert, S.; Plumbley, M.D.; Sandler, M. A dynamic programming variant of non-negative matrix deconvolution for the transcription of struck string instruments. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; IEEE: Piscataway, NJ, USA; pp. 569–573. [Google Scholar]
- Kameoka, H.; Nishimoto, T.; Sagayama, S. A multipitch analyzer based on harmonic temporal structured clustering. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 982–994. [Google Scholar] [CrossRef]
- Boulanger-Lewandowski, N.; Bengio, Y.; Vincent, P. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. arXiv 2012, arXiv:1206.6392. [Google Scholar]
- Hsieh, T.H.; Su, L.; Yang, Y.H. A streamlined encoder/decoder architecture for melody extraction. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; IEEE: Piscataway, NJ, USA; pp. 156–160. [Google Scholar]
- Lu, W.T.; Su, L. Vocal Melody Extraction with Semantic Segmentation and Audio-symbolic Domain Transfer Learning. In Proceedings of the 19th International Society for Music Information Retrieval, Paris, France, 23–27 September 2018; pp. 521–528. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Bittner, R.M.; Salamon, J.; Tierney, M.; Mauch, M.; Cannam, C.; Bello, J.P. Medleydb: A multitrack dataset for annotation-intensive mir research. In Proceedings of the 15th International Society for Music Information Retrieval Conference, Taipei, Taiwan, 27–31 October 2014; pp. 155–160. [Google Scholar]
- McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O. librosa: Audio and music signal analysis in python. In Proceedings of the 14th Python in Science Conference, Austin, TX, USA, 6–12 July 2015; Volume 8, pp. 18–25. [Google Scholar]
- Bittner, R.M.; McFee, B.; Salamon, J.; Li, P.; Bello, J.P. Deep Salience Representations for F0 Estimation in Polyphonic Music. In Proceedings of the International Society for Music Information Retrieval Conference, Suzhou, China, 23–28 October 2017; pp. 63–70. [Google Scholar]
- Mauch, M.; Dixon, S. pYIN: A fundamental frequency estimator using probabilistic threshold distributions. In Proceedings of the 2014 IEEE International Conference On Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; IEEE: Piscataway, NJ, USA; pp. 659–663. [Google Scholar]
- Morise, M.; Yokomori, F.; Ozawa, K. WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 2016, 99, 1877–1884. [Google Scholar] [CrossRef]
- Audacity Software. 1999. Available online: http://thurs3.pbworks.com/f/audacity.pdf (accessed on 25 December 2023).
Method | VR | VFA | RPA | RCA | OA |
---|---|---|---|---|---|
One-layer SA + F | 68.05 | 10.53 | 59.91 | 61.38 | 78.74 |
Two-layer SA + F | 64.68 | 7.75 | 57.69 | 58.97 | 79.83 |
Three-layer SA + F | 59.72 | 8.94 | 52.73 | 54.60 | 76.86 |
One-layer SA + CE | 60.75 | 7.74 | 54.75 | 55.75 | 78.25 |
Two-layer SA + CE | 63.68 | 7.55 | 58.30 | 59.12 | 80.25 |
Three-layer SA + CE | 67.08 | 9.62 | 59.22 | 60.43 | 78.90 |
Channel attention + F | 65.84 | 9.02 | 60.56 | 61.71 | 80.33 |
Channel attention + CE | 65.73 | 8.77 | 59.40 | 60.63 | 79.87 |
DSM [35] | 88.4 | 48.7 | 72.0 | 74.8 | 66.2 |
Lu and Su’s [31] | 77.9 | 22.4 | 68.3 | 70.0 | 70.0 |
SegNet [30] | 73.7 | 13.3 | 65.5 | 68.9 | 79.7 |
PyWorld | ||||||
---|---|---|---|---|---|---|
Audio | Expert1 | Expert2 | Expert3 | Expert4 | Expert5 | Average |
chunk0 | 3 | 3 | 4 | 3 | 2 | 3 |
chunk1 | 2 | 3 | 2 | 3 | 2 | 2.4 |
chunk4 | 1 | 2 | 2 | 1 | 1 | 1.4 |
chunk5 | 2 | 1 | 2 | 2 | 1 | 1.6 |
chunk6 | 2 | 2 | 2 | 3 | 2 | 2.2 |
chunk7 | 3 | 2 | 3 | 3 | 2 | 2.6 |
chunk8 | 1 | 2 | 4 | 3 | 1 | 2.2 |
chunk9 | 3 | 2 | 3 | 2 | 2 | 2.4 |
chunk11 | 2 | 2 | 2 | 1 | 1 | 1.6 |
chunk12 | 2 | 2 | 2 | 1 | 2 | 1.8 |
chunk13 | 2 | 2 | 3 | 1 | 1 | 1.8 |
chunk14 | 3 | 2 | 1 | 2 | 1 | 1.8 |
chunk16 | 3 | 2 | 3 | 2 | 1 | 2.2 |
chunk17 | 3 | 2 | 2 | 2 | 2 | 2.2 |
chunk18 | 1 | 2 | 2 | 1 | 1 | 1.4 |
2.04 |
pYin | ||||||
---|---|---|---|---|---|---|
Audio | Expert1 | Expert2 | Expert3 | Expert4 | Expert5 | Average |
chunk0 | 3 | 3 | 4 | 3 | 2 | 3 |
chunk1 | 4 | 3 | 2 | 3 | 1 | 2.6 |
chunk4 | 5 | 4 | 3 | 3 | 2 | 3.4 |
chunk5 | 2 | 2 | 2 | 1 | 2 | 1.8 |
chunk6 | 3 | 3 | 2 | 2 | 3 | 2.6 |
chunk7 | 3 | 3 | 2 | 2 | 1 | 2.2 |
chunk8 | 2 | 2 | 4 | 2 | 3 | 2.6 |
chunk9 | 3 | 3 | 2 | 3 | 2 | 2.6 |
chunk11 | 3 | 5 | 2 | 3 | 1 | 2.8 |
chunk12 | 4 | 4 | 2 | 3 | 1 | 2.8 |
chunk13 | 2 | 2 | 4 | 3 | 2 | 2.6 |
chunk14 | 3 | 2 | 1 | 2 | 1 | 1.8 |
chunk16 | 4 | 5 | 3 | 3 | 1 | 3.2 |
chunk17 | 3 | 2 | 3 | 3 | 1 | 2.4 |
chunk18 | 2 | 2 | 2 | 3 | 1 | 2 |
2.56 |
Channel Attention + F (Our) | ||||||
---|---|---|---|---|---|---|
Audio | Expert1 | Expert2 | Expert3 | Expert4 | Expert5 | Average |
chunk0 | 4 | 5 | 3 | 2 | 3 | 3.4 |
chunk1 | 2 | 3 | 2 | 3 | 2 | 2.4 |
chunk4 | 3 | 4 | 2 | 2 | 2 | 2.6 |
chunk5 | 2 | 3 | 4 | 3 | 3 | 3 |
chunk6 | 3 | 4 | 2 | 2 | 3 | 2.8 |
chunk7 | 3 | 3 | 2 | 1 | 2 | 2.2 |
chunk8 | 2 | 2 | 3 | 1 | 2 | 2 |
chunk9 | 2 | 2 | 2 | 1 | 3 | 2 |
chunk11 | 2 | 2 | 2 | 2 | 2 | 2 |
chunk12 | 1 | 2 | 1 | 1 | 1 | 1.2 |
chunk13 | 3 | 4 | 3 | 3 | 2 | 3 |
chunk14 | 3 | 4 | 2 | 3 | 3 | 3 |
chunk16 | 1 | 1 | 2 | 2 | 1 | 1.4 |
chunk17 | 2 | 2 | 2 | 2 | 1 | 1.8 |
chunk18 | 1 | 3 | 2 | 2 | 2 | 2 |
2.32 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bhattarai, B.; Lee, J. Korean Pansori Vocal Note Transcription Using Attention-Based Segmentation and Viterbi Decoding. Appl. Sci. 2024, 14, 492. https://doi.org/10.3390/app14020492
Bhattarai B, Lee J. Korean Pansori Vocal Note Transcription Using Attention-Based Segmentation and Viterbi Decoding. Applied Sciences. 2024; 14(2):492. https://doi.org/10.3390/app14020492
Chicago/Turabian StyleBhattarai, Bhuwan, and Joonwhoan Lee. 2024. "Korean Pansori Vocal Note Transcription Using Attention-Based Segmentation and Viterbi Decoding" Applied Sciences 14, no. 2: 492. https://doi.org/10.3390/app14020492
APA StyleBhattarai, B., & Lee, J. (2024). Korean Pansori Vocal Note Transcription Using Attention-Based Segmentation and Viterbi Decoding. Applied Sciences, 14(2), 492. https://doi.org/10.3390/app14020492