Wav2wav: Wave-to-Wave Voice Conversion
Abstract
:1. Introduction
- We propose a novel wave-to-wave voice conversion architecture that jointly trains analysis–mapping–reconstruction modules for high-quality voice conversion.
- We provide an efficient training algorithm so that the proposed GAN-based integrated model can be reliably trained with very small amounts of training data, such as VCC2018.
- The supervised learning process of standalone vocoders were modified to accommodate unsupervised learning in the end-to-end learning of the integrated model.
- We demonstrate the usefulness of the proposed method using both objective and subjective measures.
2. Related Works
2.1. CycleGAN-Based Voice Conversion
2.2. HiFi-GAN Vocoder
3. Proposed Method
Algorithm 1 wav2wav |
1 Initialization: load pretrained parameters. 2 repeat 3 from the training data randomly. /* Phase 1 */ 4 5 6 Compute using , , , and . 7 8 9 10 /* Phase 2 */ 11 12 13 Compute using , , , , , and . 14 15 16 17 18 19 20 21 22 until convergence |
4. Experiments
4.1. Objective Evaluation
4.2. Subjective Evaluation
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Nose, T.; Igarashi, Y. Real-time talking avatar on the internet using kinect and voice conversion. Int. J. Adv. Comput. Sci. Appl. 2015, 6, 301–307. [Google Scholar] [CrossRef]
- Felps, D.; Bortfeld, H.; Gutierrez-Osuna, R. Foreign accent conversion in computer assisted pronunciation training. Speech Commun. 2009, 51, 920–932. [Google Scholar] [CrossRef] [PubMed]
- Zhao, Y.; Kuruvilla-Dugdale, M.; Song, M. Voice conversion for persons with amyotrophic lateral sclerosis. IEEE J. Biomed. Health Inform. 2019, 24, 2942–2949. [Google Scholar] [CrossRef] [PubMed]
- Kingma, D.P.; Welling, M. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Repre-sentations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adver-sarial nets. In Proceedings of the Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
- Tobing, P.L.; Wu, Y.-C.; Hayashi, T.; Kobayashi, K.; Toda, T. Non-parallel voice conversion with cyclic variational autoencoder. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 659–663. [Google Scholar] [CrossRef]
- Yook, D.; Leem, S.-G.; Lee, K.; Yoo, I.-C. Many-to-many voice conversion using cycle-consistent variational autoencoder with multiple decoders. The Speaker and Language Recognition Workshop (Odyssey 2020). In Proceedings of the Odyssey: The Speaker and Language Recognition Workshop, Tokyo, Japan, 1–5 November 2020; pp. 215–221. [Google Scholar] [CrossRef]
- Kaneko, T.; Kameoka, H. CycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks. In Proceedings of the European Signal Processing Conference, Rome, Italy, 3–7 September 2018; pp. 2100–2104. [Google Scholar] [CrossRef]
- Kaneko, T.; Kameoka, H.; Tanaka, K.; Hojo, N. CycleGAN-VC2: Improved CycleGAN-based non-parallel voice conversion. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 6820–6824. [Google Scholar]
- Kaneko, T.; Kameoka, H.; Tanaka, K.; Hojo, N. CycleGAN-VC3: Examining and improving CycleGAN-VCs for mel-spectrogram conversion. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 2017–2021. [Google Scholar] [CrossRef]
- Kaneko, T.; Kameoka, H.; Tanaka, K.; Hojo, N. Maskcyclegan-VC: Learning non-parallel voice conversion with filling in frames. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 5919–5923. [Google Scholar] [CrossRef]
- Proszewska, M.; Beringer, G.; Sa’ez-Trigueros, D.; Merritt, T.; Ezzerg, A.; Barra-Chicote, R. GlowVC: Mel-spectrogram space disentangling model for language-independent text-free voice conversion. In Proceedings of the Interspeech, Incheon, Republic of Korea, 18–22 September 2022; pp. 2973–2977. [Google Scholar]
- Popov, V.; Vovk, I.; Gogoryan, V. Diffusion-based voice conversion with fast maximum likelihood sampling scheme. In Proceedings of the International Conference on Learning Representations, Virtual Event, 25–29 April 2022. [Google Scholar]
- Sisman, B.; Yamagishi, J.; King, S.; Li, H. An overview of voice conversion and its challenges: From statistical modeling to deep learning. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 132–157. [Google Scholar] [CrossRef]
- Sainath, T.N.; Weiss, R.J.; Senior, A.; Wilson, K.W.; Vinyals, O. Learning the speech front-end with raw waveform CLDNNs. In Proceedings of the Interspeech, Dresden, Germany, 6–10 September 2015; pp. 1–5. [Google Scholar] [CrossRef]
- Sailor, H.B.; Patil, H.A. Filterbank learning using convolutional restricted boltzmann machine for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Shanghai, China, 20–25 March 2016; pp. 5895–5899. [Google Scholar] [CrossRef]
- Kobayashi, K.; Toda, T.; Nakamura, S. F0 transformation techniques for statistical voice conversion with direct waveform modification with spectral differential. In Proceedings of the IEEE Spoken Language Technology Workshop, San Diego, CA, USA, 13–16 December 2016; pp. 700–963. [Google Scholar] [CrossRef]
- Kurita, Y.; Kobayashi, K.; Takeda, K.; Toda, T. Robustness of Statistical voice conversion based on direct waveform modification against background sounds. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 684–688. [Google Scholar] [CrossRef]
- Kobayashi, K.; Toda, T. Sprocket: Open-source voice conversion software. In Proceedings of the Odyssey: The Speaker and Language Recognition Workshop, Les Sables d’Olonne, France, 26–29 June 2018; pp. 203–210. [Google Scholar] [CrossRef]
- Kim, J.-W.; Jung, H.-Y.; Lee, M. Vocoder-free end-to-end voice conversion with transformer network. In Proceedings of the International Joint Conference on Neural Networks, Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar] [CrossRef]
- Nguyen, B.; Cardinaux, F. NVC-Net: End-to-end adversarial voice conversion. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Singapore, 22–27 May 2022; pp. 7012–7016. [Google Scholar] [CrossRef]
- Jeong, C. Voice Conversion Using Generative Adversarial Network Based Vocoder. Master’s Thesis, Department of Computer Science and Engineering, Korea University, Seoul, Republic of Korea, 2023. [Google Scholar]
- Lorenzo-Trueba, J.; Yamagishi, J.; Toda, T.; Saito, D.; Villavicencio, F.; Kinnunen, T.; Ling, Z. The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods. In Proceedings of the Odyssey: The Speaker and Language Recognition Workshop, Les Sables d’Olonne, France, 26–29 June 2018; pp. 195–202. [Google Scholar] [CrossRef]
- Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial net-works. In Proceedings of the International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
- Griffin, D.W.; Lim, J.S. Multiband excitation vocoder. IEEE Trans. Acoust. Speech Signal Process. 1988, 36, 1223–1235. [Google Scholar] [CrossRef]
- Morise, M.; Yokomori, F.; Ozawa, K. WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 2016, E99.D, 1877–1884. [Google Scholar] [CrossRef]
- Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. Wave-Net: A generative model for raw audio. In Proceedings of the ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13–15 September 2016; p. 125. [Google Scholar]
- Kalchbrenner, N.; Elsen, E.; Simonyan, K.; Noury, S.; Casagrande, N.; Lockhart, E.; Stimberg, F.; Oord, A.; Dieleman, S.; Kavukcuoglu, K. Efficient neural audio synthesis. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 2410–2419. [Google Scholar]
- Jin, Z.; Finkelstein, A.; Mysore, G.J.; Lu, J. Fftnet: A real-time speaker-dependent neural vocoder. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Calgary, AB, Canada, 15–20 April 2018; pp. 2251–2255. [Google Scholar] [CrossRef]
- Valin, J.-M.; Skoglund, J. LPCNET: Improving neural speech synthesis through linear prediction. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 5891–5895. [Google Scholar] [CrossRef]
- Oord, A.; Li, Y.; Babuschkin, I.; Simonyan, K.; Vinyals, O.; Kavukcuoglu, K.; Driessche, G.; Lockhart, E.; Cobo, L.C.; Stimberg, F.; et al. Parallel WaveNet: Fast high-fidelity speech synthesis. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 3918–3926. [Google Scholar]
- Prenger, R.; Valle, R.; Catanzaro, B. Waveglow: A flow-based generative network for speech synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 3617–3621. [Google Scholar] [CrossRef]
- Kumar, K.; Kumar, R.; Boissiere, T.; Gestin, L.; Teoh, W.Z.; Sotelo, J.; Brebisson, A.; Bengio, Y.; Courville, A. MelGAN: Gen-erative adversarial networks for conditional waveform synthesis. In Proceedings of the Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 14910–14921. [Google Scholar]
- Kong, J.; Kim, J.; Bae, J. HiFi-GAN: Generative adversarial networks for efficient and high-fidelity speech synthesis. In Proceedings of the Neural Information Processing Systems, Online, 6–12 December 2020; pp. 17022–17033. [Google Scholar]
- Zhou, T.; Krahenbuhl, P.; Aubry, M.; Huang, Q.; Efros, A.A. Learning dense correspondence via 3D-guided cycle con-sistency. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 117–126. [Google Scholar]
- Taigman, Y.; Polyak, A.; Wolf, L. Unsupervised cross-domain image generation. In Proceedings of the International Confer-ence on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- Kubichek, R. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of the IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, Victoria, BC, Canada, 19–21 May 1993; Volume 1, pp. 125–128. [Google Scholar] [CrossRef]
Authors | Year | Refs. | Dataset | Method | Model |
---|---|---|---|---|---|
P. L. Tobing et al. | 2019 | [6] | VCC2018 | VAE with cycle loss | RNNT |
D. Yook et al. | 2020 | [7] | VCC2018 | Ref. [6] with multiple decoders | CNN |
T. Kaneko et al. | 2018 | [8] | VCC2016 | GAN with cycle loss | CNN |
T. Kaneko et al. | 2019 | [9] | VCC2018 | Improved [8] by modifying loss and models | CNN |
T. Kaneko et al. | 2020 | [10] | VCC2018 | Improved [9] by applying TFAN norm | CNN |
T. Kaneko et al. | 2021 | [11] | VCC2018 | Improved [9] by masking input feature | CNN |
M. Proszewska et al. | 2022 | [12] | In-house | Flow-based model | LSTM |
K. Kobayashi et al. | 2016 | [17] | VCC2016 | Direct waveform modification | Diff VC |
Y. Kurita et al. | 2019 | [18] | In-house | Applied [17] to a singing voice conversion task | Diff VC |
K. Kobayashi et al. | 2018 | [19] | VCC2018 | Open-source implementation of [17] | Diff VC |
J.-W. Kim et al. | 2020 | [20] | TIDIGITS | Translation-based method using a transformer | Transformer |
B. Nguyen et al. | 2022 | [21] | VCTK | Content and speaker disentanglement | CNN |
Conversion Direction | MaskCycleGAN | wav2wav | |
---|---|---|---|
Intra-gender | F1→F2 | 7.68 ± 0.29 | 6.08 ± 0.26 |
F2→F1 | 7.44 ± 0.23 | 6.01 ± 0.26 | |
M1→M2 | 7.96 ± 0.27 | 6.04 ± 0.22 | |
M2→M1 | 7.04 ± 0.17 | 5.88 ± 0.18 | |
Average | 7.53 ± 0.13 | 6.01 ± 0.11 | |
Inter-gender | F1→M2 | 8.48 ± 0.21 | 5.86 ± 0.16 |
M2→F1 | 8.58 ± 0.21 | 5.75 ± 0.16 | |
M1→F2 | 8.70 ± 0.28 | 6.47 ± 0.24 | |
F2→M1 | 8.36 ± 0.23 | 6.38 ± 0.22 | |
Average | 8.53 ± 0.12 | 6.12 ± 0.11 | |
Average | 8.03 ± 0.10 | 6.06 ± 0.08 |
Method | Feature Extractor | Vocoder | MCD |
---|---|---|---|
MaskCycleGAN | DFT | MelGAN | 8.03 ± 0.10 |
spec2spec | DFT | HiFi-GAN | 7.88 ± 0.12 |
wav2spec | CNN | HiFi-GAN | 7.52 ± 0.11 |
spec2wav | DFT | N/A | 6.82 ± 0.10 |
wav2wav | CNN | N/A | 6.06 ± 0.08 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jeong, C.; Chang, H.-p.; Yoo, I.-C.; Yook, D. Wav2wav: Wave-to-Wave Voice Conversion. Appl. Sci. 2024, 14, 4251. https://doi.org/10.3390/app14104251
Jeong C, Chang H-p, Yoo I-C, Yook D. Wav2wav: Wave-to-Wave Voice Conversion. Applied Sciences. 2024; 14(10):4251. https://doi.org/10.3390/app14104251
Chicago/Turabian StyleJeong, Changhyeon, Hyung-pil Chang, In-Chul Yoo, and Dongsuk Yook. 2024. "Wav2wav: Wave-to-Wave Voice Conversion" Applied Sciences 14, no. 10: 4251. https://doi.org/10.3390/app14104251
APA StyleJeong, C., Chang, H. -p., Yoo, I. -C., & Yook, D. (2024). Wav2wav: Wave-to-Wave Voice Conversion. Applied Sciences, 14(10), 4251. https://doi.org/10.3390/app14104251