MixDiff-TTS: Mixture Alignment and Diffusion Model for Text-to-Speech
Abstract
:1. Introduction
- We propose MixDiff-TTS, a non-autoregressive TTS model that integrates a mixture alignment mechanism with a diffusion model.
- We introduce a Word-to-Phoneme Attention module with relative position bias to improve the model’s ability to handle long text sequences.
- We incorporate a pre-net structure to enhance the model’s learning capability for speech synthesis tasks.
- We incorporate a post-net structure to optimize the reconstruction quality of mel-spectrograms.
- Our objective and subjective evaluations show that MixDiff-TTS outperforms baselines in multiple metrics, validating its effectiveness.
2. Materials and Methods
2.1. Background
2.1.1. Diffusion Models
2.1.2. HiFi-GAN Vocoder
2.2. MixDiff-TTS
2.2.1. Motivation
2.2.2. Linguistic Encoder
2.2.3. Auxiliary Decoder
2.2.4. Denoiser
- An element-wise addition operation to add to the hidden sequence.
- A non-causal convolutional layer that transforms the hidden sequence from C to channels (C typically set to 256).
- A convolutional layer mapping to channels.
- A gating unit that fuses the input information and conditional information.
- A residual block that splits the fused hidden states into two branches, each with C channels. This structure enables the denoiser to fuse features across hierarchical levels, thereby generating the final prediction.
2.2.5. Training Loss
3. Results
3.1. Dataset
3.1.1. Dataset Selection
3.1.2. Dataset Processing
3.2. Model Configuration
3.3. Evaluation
3.3.1. Evaluation Metrics
3.3.2. Experimental Results
3.4. Feature Visualization
3.5. Ablation Studies
4. Discussion
- In speech synthesis tasks, multi-speaker speech synthesis models have become a research hotspot to meet the needs of different scenarios. Such models can generate speech with different genders and timbres without retraining for each specific requirement, making them suitable for various applications. However, MixDiff-TTS is only applicable to single-speaker speech synthesis and lacks support for multi-speaker tasks, limiting its applicability in broader scenarios;
- In speech synthesis, fully end-to-end TTS models have emerged as a pivotal research focus, gaining substantial attention in recent years. A fully end-to-end model can directly generate speech waveforms from raw text using a unified architecture, without relying on intermediate feature representations (such as mel-spectrograms). In contrast, MixDiff-TTS generates intermediate feature representations during synthesis and then uses a vocoder to convert them into final speech waveforms. This results in the need for separate training of the acoustic model and vocoder, incurring additional training costs.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Liu, J.; Xie, Z.; Zhang, C.; Shi, G. A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2. Int. J. Mach. Learn. Cybern. 2021, 12, 2809–2823. [Google Scholar] [CrossRef]
- de Barcelos Silva, A.; Gomes, M.M.; Da Costa, C.A.; da Rosa Righi, R.; Barbosa, J.L.V.; Pessin, G.; De Doncker, G.; Federizzi, G. Intelligent personal assistants: A systematic literature review. Expert Syst. Appl. 2020, 147, 113193. [Google Scholar] [CrossRef]
- Tan, X.; Chen, J.; Liu, H.; Cong, J.; Zhang, C.; Liu, Y.; Wang, X.; Leng, Y.; Yi, Y.; He, L.; et al. Naturalspeech: End-to-end text-to-speech synthesis with human-level quality. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4234–4245. [Google Scholar] [CrossRef] [PubMed]
- Panagiotopoulos, D.; Orovas, C.; Syndoukas, D. Neural network based autonomous control of a speech synthesis system. Intell. Syst. Appl. 2022, 14, 200077. [Google Scholar] [CrossRef]
- Bazzi, A.; Slock, D.T.; Meilhac, L. Sparse recovery using an iterative Variational Bayes algorithm and application to AoA estimation. In Proceedings of the 2016 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), Limassol, Cyprus, 12–14 December 2016; pp. 197–202. [Google Scholar] [CrossRef]
- Bazzi, A.; Slock, D.T.M.; Meilhac, L. A Newton-type Forward Backward Greedy method for multi-snapshot compressed sensing. In Proceedings of the 2017 51st Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 29 October–1 November 2017; pp. 1178–1182. [Google Scholar] [CrossRef]
- Li, N.; Liu, S.; Liu, Y.; Zhao, S.; Liu, M. Neural speech synthesis with transformer network. In Proceedings of the AAAI conference on artificial intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6706–6713. [Google Scholar] [CrossRef]
- Lee, M.; Lee, J.; Chang, J.H. Non-autoregressive fully parallel deep convolutional neural speech synthesis. IEEE/ACM Trans. Audio, Speech, Lang. Process. 2022, 30, 1150–1159. [Google Scholar] [CrossRef]
- Huang, R.; Zhao, Z.; Liu, H.; Liu, J.; Cui, C.; Ren, Y. Prodiff: Progressive fast diffusion model for high-quality text-to-speech. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 2595–2605. [Google Scholar] [CrossRef]
- Łańcucki, A. Fastpitch: Parallel text-to-speech with pitch prediction. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6588–6592. [Google Scholar] [CrossRef]
- Oord, A.v.d.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. In Proceedings of the 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), Sunnyvale, CA, USA, 13–15 September 2016; p. 125. [Google Scholar]
- Ren, Y.; Hu, C.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.Y. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
- Kim, J.; Kim, S.; Kong, J.; Yoon, S. Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 8067–8077. [Google Scholar]
- Ren, Y.; Ruan, Y.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.Y. FastSpeech: Fast, Robust and Controllable Text to Speech. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
- Wang, Y.; Skerry-Ryan, R.; Stanton, D.; Wu, Y.; Weiss, R.J.; Jaitly, N.; Yang, Z.; Xiao, Y.; Chen, Z.; Bengio, S.; et al. Tacotron: Towards End-to-End Speech Synthesis. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 4006–4010. [Google Scholar] [CrossRef]
- Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerrv-Ryan, R.; et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4779–4783. [Google Scholar] [CrossRef]
- Liu, J.; Li, C.; Ren, Y.; Chen, F.; Zhao, Z. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 11020–11028. [Google Scholar] [CrossRef]
- Ren, Y.; Liu, J.; Zhao, Z. PortaSpeech: Portable and High-Quality Generative Text-to-Speech. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 13963–13974. [Google Scholar]
- Kong, J.; Kim, J.; Bae, J. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 17022–17033. [Google Scholar]
- Chen, N.; Zhang, Y.; Zen, H.; Weiss, R.J.; Norouzi, M.; Chan, W. WaveGrad: Estimating Gradients for Waveform Generation. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
- Kim, H.; Kim, S.; Yoon, S. Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance. In Proceedings of the 39th International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; Volume 162, pp. 11119–11133. [Google Scholar]
- Yang, X.; Zhou, D.; Feng, J.; Wang, X. Diffusion Probabilistic Model Made Slim. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 11–15 June 2023; pp. 22552–22562. [Google Scholar] [CrossRef]
- Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
- Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M.H. Diffusion Models: A Comprehensive Survey of Methods and Applications. ACM Comput. Surv. 2023, 56, 1–51. [Google Scholar] [CrossRef]
- Wang, W.; Bao, J.; Zhou, W.; Chen, D.; Chen, D.; Yuan, L.; Li, H. SinDiffusion: Learning a Diffusion Model From a Single Natural Image. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 3412–3423. [Google Scholar] [CrossRef] [PubMed]
- Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 6840–6851. [Google Scholar]
- Prenger, R.; Valle, R.; Catanzaro, B. Waveglow: A Flow-based Generative Network for Speech Synthesis. In Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 3617–3621. [Google Scholar] [CrossRef]
- Oord, A.; Li, Y.; Babuschkin, I.; Simonyan, K.; Vinyals, O.; Kavukcuoglu, K.; Driessche, G.; Lockhart, E.; Cobo, L.; Stimberg, F.; et al. Parallel WaveNet: Fast High-Fidelity Speech Synthesis. In Proceedings of the 35th International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 3918–3926. [Google Scholar]
- Kumar, K.; Kumar, R.; de Boissiere, T.; Gestin, L.; Teoh, W.Z.; Sotelo, J.; de Brébisson, A.; Bengio, Y.; Courville, A.C. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
- Kong, Z.; Ping, W.; Huang, J.; Zhao, K.; Catanzaro, B. DiffWave: A Versatile Diffusion Model for Audio Synthesis. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
- Huang, R.; Lam, M.W.Y.; Wang, J.; Su, D.; Yu, D.; Ren, Y.; Zhao, Z. FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Vienna, Austria, 23–29 July 2022; pp. 4157–4163. [Google Scholar] [CrossRef]
- Rethage, D.; Pons, J.; Serra, X. A Wavenet for Speech Denoising. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5069–5073. [Google Scholar] [CrossRef]
- Valentini-Botinhao, C.; King, S. Detection and Analysis of Attention Errors in Sequence-to-Sequence Text-to-Speech. In Proceedings of the Interspeech, Brno, Czechia, 30 August–3 September 2021; pp. 2746–2750. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Wang, Z.; Bovik, A.; Sheikh, H.; Simoncelli, E. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
- Kubichek, R. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of the IEEE Pacific Rim Conference on Communications Computers and Signal Processing, Victoria, BC, Canada, 19–21 May 1993; Volume 1, pp. 125–128. [Google Scholar] [CrossRef]
- Kilgour, K.; Zuluaga, M.; Roblek, D.; Sharifi, M. Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 2350–2354. [Google Scholar] [CrossRef]
- Choi, Y.; Jung, Y.; Suh, Y.; Kim, H. Learning to Maximize Speech Quality Directly Using MOS Prediction for Neural Text-to-Speech. IEEE Access 2022, 10, 52621–52629. [Google Scholar] [CrossRef]
- Langlois, Q.; Jodogne, S. Practical Study of Deep Learning Models for Speech Synthesis. In Proceedings of the 16th International Conference on PErvasive Technologies Related to Assistive Environments, New York, NY, USA, 5–7 July 2023; pp. 700–706. [Google Scholar] [CrossRef]
- Müller, M. Dynamic time warping. In Information Retrieval for Music and Motion; Springer: Berlin/Heidelberg, Germany, 2007; pp. 69–84. [Google Scholar] [CrossRef]
Model | SSIM | MCD | RMSE | FAD | Params | RTF | MOS | CMOS |
---|---|---|---|---|---|---|---|---|
Ground Truth | 4.37 ± 0.08 | |||||||
FastSpeech2 | 0.496 | 6.751 | 0.329 | 2.1460 | 35.16M | 0.1486 | 3.82 ± 0.09 | 0.000 |
PortaSpeech | 0.505 | 6.683 | 0.323 | 1.8593 | 23.97M | 0.1364 | 3.98 ± 0.07 | 0.189 |
DiffSpeech | 0.501 | 6.735 | 0.335 | 1.5165 | 45.11M | 0.1456 | 3.92 ± 0.08 | 0.179 |
MixDiff-TTS | 0.507 | 6.652 | 0.337 | 1.5013 | 46.31M | 0.1391 | 3.95 ± 0.08 | 0.185 |
Setting | MOS | CMOS |
---|---|---|
MixDiff-TTS | 3.95 ± 0.08 | 0.000 |
MixDiff-TTS—phoneme-level hard alignment | 3.83 ± 0.07 | −0.247 |
Setting | SSIM | MCD | RMSE | MOS | CMOS |
---|---|---|---|---|---|
MixDiff-TTS | 0.507 | 6.652 | 0.337 | 3.95 ± 0.08 | 0.000 |
MixDiff-TTS—W2P-RelPosAttention | 0.503 | 6.679 | 0.329 | 3.91 ± 0.08 | −0.135 |
Setting | SSIM | MCD | RMSE | MOS | CMOS |
---|---|---|---|---|---|
MixDiff-TTS | 0.507 | 6.652 | 0.337 | 3.95 ± 0.08 | 0.000 |
MixDiff-TTS—pre-net | 0.501 | 6.661 | 0.372 | 3.93 ± 0.07 | −0.129 |
Setting | SSIM | MCD | RMSE | MOS | CMOS |
---|---|---|---|---|---|
MixDiff-TTS | 0.507 | 6.652 | 0.337 | 3.95 ± 0.08 | 0.000 |
MixDiff-TTS—residual network | 0.501 | 6.725 | 0.345 | 3.92 ± 0.07 | −0.131 |
MixDiff-TTS—post-net | 0.505 | 6.793 | 0.331 | 3.86 ± 0.08 | −0.218 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Long, Y.; Yang, K.; Ma, Y.; Yang, Y. MixDiff-TTS: Mixture Alignment and Diffusion Model for Text-to-Speech. Appl. Sci. 2025, 15, 4810. https://doi.org/10.3390/app15094810
Long Y, Yang K, Ma Y, Yang Y. MixDiff-TTS: Mixture Alignment and Diffusion Model for Text-to-Speech. Applied Sciences. 2025; 15(9):4810. https://doi.org/10.3390/app15094810
Chicago/Turabian StyleLong, Yongqiu, Kai Yang, Yuan Ma, and Ying Yang. 2025. "MixDiff-TTS: Mixture Alignment and Diffusion Model for Text-to-Speech" Applied Sciences 15, no. 9: 4810. https://doi.org/10.3390/app15094810
APA StyleLong, Y., Yang, K., Ma, Y., & Yang, Y. (2025). MixDiff-TTS: Mixture Alignment and Diffusion Model for Text-to-Speech. Applied Sciences, 15(9), 4810. https://doi.org/10.3390/app15094810