Improving Singing Voice Separation Using Curriculum Learning on Recurrent Neural Networks
Abstract
:Featured Application
Abstract
1. Introduction
2. Related Work
2.1. RNN-Based Singing Voice Separation
2.2. Loss Function of Singing Voice Separation Models
3. Proposed Curriculum Learning Technique on RNN and U-Net
3.1. Stacked RNN-based Separation Model
Algorithm 1 Stacked RNN Learning |
|
3.2. U-Net Baseline Model
Algorithm 2 U-Net Learning |
|
3.3. Proposed Curriculum Learning for Singing Voice Separation
Algorithm 3 Curriculum Learning |
|
4. Evaluation
4.1. Separation Model Configuration
4.2. Memory and Computational Complexity Analysis
4.3. Experimental Results of MIR-1K Dataset
4.4. Experimental Results of ccMixter Dataset
4.5. Experimental Results of MUSDB18 Dataset
5. Discussion
Author Contributions
Funding
Conflicts of Interest
Abbreviations
STFT | Short-Time Fourier transform |
CNN | Convolutional Neural Network |
RNN | Recurrent Neural Network |
CRNN | Convolutional Recurrent Neural Network |
GAN | Generative Adversarial Network |
GNSDR | Global Normalized Source to Distortion Ratio |
GSIR | Global Source to Interference Ratio |
GSAR | Global Source to Artifact Ratio |
References
- Lin, K.W.E.; Balamurali, B.T.; Koh, E.; Lui, S.; Herremans, D. Singing Voice Separation Using a Deep Convolutional Neural Network Trained by Ideal Binary Mask and Cross Entropy. Neural Comput. Appl. 2020, 32, 1037–1050. [Google Scholar] [CrossRef] [Green Version]
- Takahashi, N.; Mitsufuji, Y. Multi-scale multi-band densenets for audio source separation. In Proceedings of the 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 15–18 October 2017; pp. 21–25. [Google Scholar]
- Chandna, P.; Miron, M.; Janer, J.; Gómez, E. Monoaural audio source separation using deep convolutional neural networks. In Proceedings of the International Conference on Latent Variable Analysis and Signal Separation, Grenoble, France, 21–23 February 2017; pp. 258–266. [Google Scholar]
- Huang, P.S.; Kim, M.; Hasegawa-Johnson, M. Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 2136–2147. [Google Scholar] [CrossRef] [Green Version]
- Huang, P.S.; Kim, M.; Hasegawa-Johnson, M.; Smaragdis, P. Singing-Voice Separation From Monaural Recordings Using Deep Recurrent Neural Networks. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Taipei, Taiwan, 27–31 October 2014; pp. 477–482. [Google Scholar]
- Huang, P.S.; Kim, M.; Hasegawa-Johnson, M.; Smaragdis, P. Deep learning for monaural speech separation. In Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 1562–1566. [Google Scholar]
- Jansson, A.; Humphrey, E.; Montecchio, N.; Bittner, R.; Kumar, A.; Weyde, T. Singing voice separation with deep U-Net convolutional networks. In Proceedings of the 18th International Society for Music Information Retrieval (ISMIR) Conference, Suzhou, China, 23–27 October2017; pp. 745–751. [Google Scholar]
- Oh, J.; Kim, D.; Yun, S. Spectrogram-channels u-net: A source separation model viewing each channel as the spectrogram of each source. arXiv 2018, arXiv:1810.11520. [Google Scholar]
- Lluís, F.; Pons, J.; Serra, X. End-to-end music source separation: Is it possible in the waveform domain? arXiv 2018, arXiv:1810.12187. [Google Scholar]
- Stoller, D.; Ewert, S.; Dixon, S. Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation. arXiv 2018, arXiv:1806.03185. [Google Scholar]
- Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum Learning. In Proceedings of the International Conference on Machine Learning (ICML), Montreal, QC, Canada, 14–18 June 2009; pp. 41–48. [Google Scholar] [CrossRef]
- Gong, C.; Tao, D.; Maybank, S.J.; Liu, W.; Kang, G.; Yang, J. Multi-Modal Curriculum Learning for Semi-Supervised Image Classification. IEEE Trans. Image Process. 2016, 25, 3249–3260. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Li, S.; Zhu, X.; Huang, Q.; Xu, H.; Kuo, C.J. Multiple Instance Curriculum Learning for Weakly Supervised Object Detection. arXiv 2017, arXiv:1711.09191. [Google Scholar]
- Wang, J.; Wang, X.; Liu, W. Weakly-and Semi-supervised Faster R-CNN with Curriculum Learning. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 2416–2421. [Google Scholar]
- Ilg, E.; Mayer, N.; Saikia, T.; Keuper, M.; Dosovitskiy, A.; Brox, T. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2462–2470. [Google Scholar]
- Graves, A.; Bellemare, M.G.; Menick, J.; Munos, R.; Kavukcuoglu, K. Automated curriculum learning for neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, Sydney, Australia, 6–11 August 2017; pp. 1311–1320. [Google Scholar]
- Takahashi, N.; Goswami, N.; Mitsufuji, Y. MMDenseLSTM: An efficient combination of convolutional and recurrent neural networks for audio source separation. In Proceedings of the 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), Tokyo, Japan, 17–20 September 2018; pp. 106–110. [Google Scholar]
- Stöter, F.R.; Uhlich, S.; Liutkus, A.; Mitsufuji, Y. Open-Unmix - A Reference Implementation for Music Source Separation. J. Open Source Softw. 2019. [Google Scholar] [CrossRef] [Green Version]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Chan, A.H.S.; Ao, S.I. Advances in Industrial Engineering and Operations Research; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
- Févotte, C.; Bertin, N.; Durrieu, J.L. Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis. Neural Comput. 2009, 21, 793–830. [Google Scholar] [CrossRef] [PubMed]
- Lefèvre, A.; Bach, F.; Févotte, C. Itakura-Saito nonnegative matrix factorization with group sparsity. In Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; pp. 21–24. [Google Scholar]
- Févotte, C.; Idier, J. Algorithms for Nonnegative Matrix Factorization with the β-Divergence. Neural Comput. 2011, 23, 2421–2456. [Google Scholar] [CrossRef]
- Oja, Z.L.Y. Selecting β-Divergence for Nonnegative Matrix Factorization by Score Matching. In Proceedings of the International Conference on Artificial Neural Networks and Machine Learning (ICANN), Lausanne, Switzerland, 11–14 September 2012; pp. 419–426. [Google Scholar] [CrossRef]
- Vincent, E.; Rémi Gribonval, C.F. Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 1462–1469. [Google Scholar] [CrossRef] [Green Version]
- Lotfian, R.; Busso, C. Curriculum learning for speech emotion recognition from crowdsourced labels. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 2019, 27, 815–826. [Google Scholar] [CrossRef] [Green Version]
- Luo, Y.; Chen, Z.; Mesgarani, N. Speaker-independent speech separation with deep attractor network. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 787–796. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Germany, 2015. [Google Scholar]
- Hsu, C.L.; Jang, J.S.R. On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 310–319. [Google Scholar]
- Liutkus, A.; Fitzgerald, D.; Rafii, Z. Scalable audio separation with light kernel additive modelling. In Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015. [Google Scholar]
- Rafii, Z.; Liutkus, A.; Stöter, F.R.; Mimilakis, S.I.; Bittner, R. The MUSDB18 corpus for music separation. Zenodo 2017. [Google Scholar] [CrossRef]
- Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; Volume 37, pp. 448–456. [Google Scholar]
- Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of the International Conference on Machine Learning (ICML), Atlanta, GA, USA, 16–21 June 2013; Volume 30. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Conv | Upconv | ||||
---|---|---|---|---|---|
Layer Number | Output | Size | Input | Output | Size |
0 | |||||
1 | |||||
2 | |||||
3 | |||||
4 | |||||
5 | |||||
6 | - | - | - |
Instrument | Vocal | ||||||
---|---|---|---|---|---|---|---|
Model | |||||||
RNN | baseline | 11.72 | 7.02 | 5.23 | 12.19 | 6.79 | 5.03 |
proposed | ↑ 12.30 | 7.08 | 5.43 | 11.75 | 7.25 | 5.30 | |
U-Net | baseline | 10.67 | 8.31 | 5.91 | 12.72 | 7.38 | 5.84 |
proposed | 11.10 | 8.25 | 6.03 | 12.41 | 7.68 | 5.99 |
Instrument | Vocal | ||||||
---|---|---|---|---|---|---|---|
Model | |||||||
RNN | baseline | 10.75 | 10.38 | 3.26 | 11.09 | 3.18 | 5.60 |
proposed | 10.41 | ↑ 11.30 | 3.53 | ⇑ 12.35 | ↑ 3.73 | ↑ 6.34 | |
U-Net | baseline | 9.00 | 12.98 | 3.36 | 14.55 | 3.38 | 6.27 |
proposed | 9.25 | 12.71 | 3.50 | 14.50 | 3.85 | 6.68 |
Instrument | Vocal | ||||||
---|---|---|---|---|---|---|---|
Model | |||||||
RNN | baseline | 15.41 | 11.48 | 3.15 | 10.34 | 3.91 | 8.55 |
proposed | ⇑ 17.10 | ↑ 12.26 | ⇑ 4.21 | 10.56 | 4.27 | 8.98 | |
U-Net | baseline | 13.42 | 15.06 | 4.10 | 10.97 | 3.96 | 8.84 |
proposed | ↑ 14.03 | 14.97 | 4.48 | 10.61 | ↑ 4.73 | ↑ 9.48 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kang, S.; Park , J.-S.; Jang , G.-J. Improving Singing Voice Separation Using Curriculum Learning on Recurrent Neural Networks. Appl. Sci. 2020, 10, 2465. https://doi.org/10.3390/app10072465
Kang S, Park J-S, Jang G-J. Improving Singing Voice Separation Using Curriculum Learning on Recurrent Neural Networks. Applied Sciences. 2020; 10(7):2465. https://doi.org/10.3390/app10072465
Chicago/Turabian StyleKang, Seungtae, Jeong-Sik Park , and Gil-Jin Jang . 2020. "Improving Singing Voice Separation Using Curriculum Learning on Recurrent Neural Networks" Applied Sciences 10, no. 7: 2465. https://doi.org/10.3390/app10072465
APA StyleKang, S., Park , J. -S., & Jang , G. -J. (2020). Improving Singing Voice Separation Using Curriculum Learning on Recurrent Neural Networks. Applied Sciences, 10(7), 2465. https://doi.org/10.3390/app10072465