Tibetan Speech Synthesis Based on Pre-Traind Mixture Alignment FastSpeech2

Zhou, Qing; Xu, Xiaona; Zhao, Yue

doi:10.3390/app14156834

Open AccessArticle

Tibetan Speech Synthesis Based on Pre-Traind Mixture Alignment FastSpeech2

by

Qing Zhou

^1,2

,

Xiaona Xu

^1,2,*

and

Yue Zhao

^1,2

¹

Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of China, Beijing 100081, China

²

School of Information Engineering, Minzu University of China, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(15), 6834; https://doi.org/10.3390/app14156834

Submission received: 2 July 2024 / Revised: 31 July 2024 / Accepted: 2 August 2024 / Published: 5 August 2024

(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Most current research in Tibetan speech synthesis relies primarily on autoregressive models in deep learning. However, these models face challenges such as slow inference, skipped readings, and repetitions. To overcome these issues, we propose an enhanced non-autoregressive acoustic model combined with a vocoder for Tibetan speech synthesis. Specifically, we introduce the mixture alignment FastSpeech2 method to correct errors caused by hard alignment in the original FastSpeech2 method. This new method employs soft alignment at the level of Latin letters and hard alignment at the level of Tibetan characters, thereby improving alignment accuracy between text and speech and enhancing the naturalness and intelligibility of the synthesized speech. Additionally, we integrate pitch and energy information into the model, further enhancing overall synthesis quality. Furthermore, Tibetan has relatively smaller text-to-audio datasets compared to widely studied languages. To address these limited resources, we employ a transfer learning approach to pre-train the model with data from resource-rich languages. Subsequently, this pre-trained mixture alignment FastSpeech2 model is fine-tuned for Tibetan speech synthesis. Experimental results demonstrate that the mixture alignment FastSpeech2 model produces higher-quality speech compared to the original FastSpeech2 model, particularly when pre-trained on an English dataset, resulting in further improvements in clarity and naturalness.

Keywords:

FastSpeech2; mixture alignment; pre-trained model; Tibetan speech synthesis

1. Introduction

Speech synthesis is a technology that converts text into speech signals by analyzing textual features and mapping them onto acoustic representations. It utilizes digital signal processing methods to produce audible speech, enabling the transformation of text into human-like speech that users can comprehend through auditory means. Within artificial intelligence, speech synthesis integrates speech processing techniques with natural language processing and automation, driving advancements in human–machine interaction [1].

Advancements in technology and improvements in computer performance have made human–machine speech interaction a pivotal research focus in artificial intelligence. Speech synthesis, as a core technology, has garnered unprecedented attention in recent years [2]. Current research extensively explores deep learning techniques to generate realistic and fluent human-like voices, ranging from convolutional neural networks [3] and recurrent neural networks [4] to generative adversarial networks [5]. These advancements include the application of reinforcement learning, enabling deep learning models to adapt voice synthesis based on contextual cues. Contemporary speech synthesis has evolved from traditional waveform, parameter, and rule-based methods to those based on deep learning.

Since the early 21st century, deep learning technologies, particularly neural networks applied in speech synthesis, have rapidly advanced. Google’s WaveNet model [6], introduced in 2016, marked a significant milestone in deep learning-based speech synthesis technology. WaveNet directly generates speech waveforms using deep convolutional neural networks, significantly enhancing the naturalness and quality of synthesized speech. Subsequent models like Tacotron [7] and Tacotron2 [8] further propelled text-to-speech conversion towards end-to-end models, achieving more efficient and natural speech synthesis. Tacotron has also been utilized in voice cloning, multilingual synthesis, and emotional speech synthesis [2,9]. Current deep learning approaches such as Transformer TTS [10], FastSpeech [11], and FastSpeech2 [12] not only enhance synthesis efficiency and speech quality but also reduce computational resource demands, enabling broader practical applications.

In the field of Tibetan speech synthesis, technological development has transitioned from traditional methods to modern end-to-end approaches. Initial Tibetan speech synthesis systems heavily relied on HMM statistical parameter models [13], demonstrating effectiveness with limited data but facing challenges in naturalness and audio quality. With the introduction of deep learning technologies, Tibetan speech synthesis has gradually shifted towards end-to-end neural network models.

Initially, a Tibetan speech synthesis model was developed based on the Tacotron framework [14], utilizing the classic Griffin–Lim vocoder [15] to generate speech waveforms, which generated a smooth waveform and led to a poor quality of synthesized speech. To address this issue, Tacotron2 emerged and was applied to speech synthesis in Tibetan Lhasa [16]. Tacotron2 employs the WaveNet vocoder to enhance the quality of synthesis; nevertheless, its reliance on the LSTM [17] model for encoding contextual information presents challenges in constructing models with long-term dependencies. Conversely, Transformer TTS [18] relies entirely on the self-attention mechanism and can effectively manage long-term dependencies by constructing context vectors from various perspectives using the multi-head attention mechanism. However, Transformer TTS is hindered by its inability to parallelize the inference process and the issue of misaligned attention codecs. In response, the FastSpeech2 method has been applied for Tibetan speech synthesis [19]. Since the FastSpeech2 model utilizes a non-autoregressive approach, it boasts a faster generation speed than traditional autoregressive models. Furthermore, its flexibility in modeling long-term dependencies enhances both the quality and smoothness of speech synthesis, making the FastSpeech2 model highly regarded in current research.

In this paper we apply the FastSpeech2 model to Tibetan speech synthesis. But, in FastSpeech2, the external alignment tool MFA (Montreal forced alignment) [20] is utilized to ascertain the duration of Latin letters, resulting in errors and noise during the alignment of Latin letters due to the absence of distinct boundaries between different letters. To address this problem, we propose the mixture alignment FastSpeech2 model, which utilizes the idea of a mixture alignment mechanism in the linguistic encoder in PortaSpeech [21]. This model employs soft alignment at the level of Latin letters and hard alignment at the level of Tibetan characters, thereby enhancing semantic information comprehension and mitigating errors and noise attributed to the hard alignment of Latin letters.

In addition, the majority of end-to-end speech synthesis systems require a substantial volume of paired high-quality speech data for model training [22]. But, the efficacy of directly applying the FastSpeech2 model to Tibetan speech synthesis is hindered by the objective lack of large-scale, high-quality Tibetan speech–text pairs. Consequently, this study pre-trains the FastSpeech2 baseline model using an English dataset, followed by transitioning the pre-trained model to Tibetan speech synthesis. The experimental results demonstrate that our proposed mixture alignment FastSpeech2 model excels in synthesizing audio quality, predicting Mel spectrograms, and achieving precise attention alignment. Furthermore, the use of pre-trained models has also contributed to some extent in enhancing the quality of synthesized speech.

The main contributions of this work are summarized as follows:

We address the issue of misalignment caused by hard alignment at the Latin letter level by integrating the concept of mixture alignment into the FastSpeech2 model. Our proposed model, termed mixture alignment FastSpeech2, is applied to Tibetan speech synthesis, demonstrating significant improvements in alignment accuracy and synthesis quality.
Given the limited availability of Tibetan data, we advocate for the transfer of pre-trained model trained on larger datasets to the Tibetan language domain. This approach aims to enhance the naturalness of synthesized speech by leveraging the rich representations learned from abundant resources.

In Section 2, we describe the constructed end-to-end speech synthesis architecture, including the model used and its improvement. Section 3 describes the experimental details and results. Finally, Section 4 summarizes the full paper.

2. Model Architecture

The mixture alignment FastSpeech2 model comprises four main components: Tibetan text preprocessing, the mixture alignment encoder, the decoder, and the HiFi-GAN vocoder [23]. Figure 1 provides an overview of the model structure.

2.1. Tibetan Text Preprocessing

The majority of contemporary Tibetan speech synthesis models employing deep neural networks utilize Tibetan phonemes as their foundational synthesis units. Converting Tibetan text into phonemes necessitates a comprehensive and precise syllable lexicon. Characterized by its uniqueness and complexity, the Tibetan writing system possesses a rich character structure. This lexicon encompasses a myriad of potential syllable combinations, each corresponding to a distinct phonological unit. Constructing such a syllabic lexicon requires profound expertise in Tibetan linguistics and can be challenging due to the complexity and size of the lexicon. Additionally, even with a precise syllabic dictionary, there may still be potential noise.

To address these challenges and simplify the process, this paper opts to use Wylie transcription to convert Tibetan characters into Latin letters. Wylie transcription is a widely recognized system that provides a standardized approach to transliterating Tibetan script into Latin characters. This system maps Tibetan syllables to specific Latin letter combinations based on their phonetic properties, thus creating a one-to-one correspondence between Tibetan sounds and Latin representations. In the Wylie transcription system, the process of converting Tibetan characters into Latin letters includes the following rules:

Tibetan vowels and consonants are transcribed into specific Latin letters or combinations, ensuring consistency across different Tibetan words and phrases.
Tibetan syllables typically include a root letter along with possible prefixes, superscripts, subscripts, and postfixes. The Wylie transcription system provides specific rules for these components.
Special symbols and tonal marks in Tibetan are represented by specific Latin characters in the Wylie transcription system to reduce ambiguity and preserve phonetic detail. Additionally, some syllable combinations undergo phonological changes in specific contexts, and the Wylie system typically transcribes the original characters directly.

This method reduces the complexity and potential noise associated with managing a comprehensive syllabic lexicon, thereby providing a more efficient solution for synthesizing Tibetan speech. The use of Wylie transcription helps address lexical disparities by offering a consistent transliteration approach, which is crucial for accurate grapheme conversion.

In practical applications, the process of transcribing Tibetan characters into Latin letters involves several steps. Initially, Tibetan text is processed by removing unnecessary spaces and punctuation marks and segmenting them into individual syllables. Each Tibetan letter is subsequently matched to its corresponding Latin letter according to Wylie transcription rules, involving the identification and conversion of root letters, prefixes, superscripts, subscripts, and suffixes. Finally, these transcribed components are combined to generate the complete Wylie transcription string. Notably, the symbol “|” is omitted at the end of sentences, and the symbol “·” is transcribed as a space, simplifying the treatment of special symbols. Figure 2 displays the Latin letter representations of certain Tibetan characters transcribed via Wylie, visually illustrating these steps and providing a clear and standardized Latin representation of Tibetan sentences while preserving the original letter order.

2.2. Mixture Alignment Encoder

Compared to the encoder in FastSpeech2, the mixture alignment encoder incorporates an additional Tibetan character encoder. The architecture of the mixture alignment encoder is depicted in Figure 3a. CP represents the pooling operation at the Tibetan character level, and LR represents the length regulator introduced in FastSpeech [11]. A sequence of Latin letters containing Tibetan character boundaries is passed as input to the encoder. The Latin letter encoder transforms the input sequence of Latin letters into a Latin letters hidden state, subsequently applying character pooling on the Latin letters hidden state to obtain input from the character encoder, and averaging the Latin letters hidden state within each character according to character boundary information. The character encoder processes the received Latin letters hidden states into character level hidden states, subsequently utilizing a length regulator to adjust the character level hidden states based on character-level durations in order to align with the length of the target Mel spectrogram. Lastly, the Latin letters hidden state and the character level hidden state are processed by character-level relative position encoding and then forwarded to the character-to-letter attention module.

In the mixture alignment model, the extension of the Latin letters’ hidden representations is achieved through the integration of the character level duration predictor and the character-to-letter attention mechanism within the mixture alignment encoder. The durations of Tibetan characters are determined by summing the durations of each Latin letter within a character to ensure precise alignment at the character level. Subsequently, the character-to-letter attention mechanism calculates the attentional weight of each Latin letter’s hidden representation within a character to achieve soft alignment at the Latin letter level. The advantage of the mixture alignment mechanism lies in the fact that maintaining precise alignment at the character level results in reduced errors compared to alignment at the Latin letter level. Furthermore, implementing soft alignment at the Latin letter level facilitates the acquisition of more fine-grained information.

2.2.1. Encoder Architecture

In this paper, both the letter encoder and the character encoder share an identical architecture. Each encoder block includes a multi-head self-attention mechanism, residual connections, normalization, and a one-dimensional convolutional neural network (1D CNN). The multi-head attention mechanism enables enhanced extraction of cross-positional information. Residual connections are employed to address the issues of gradient vanishing and explosion. Normalization adjusts the mean and variance of the input values of each layer to achieve uniformity, thereby enhancing the convergence efficiency. The specific model structure is shown in Figure 3b.

To establish closer connections between adjacent Latin letters, a one-dimensional convolutional neural network (1D CNN) with ReLU activation is employed instead of fully connected networks. A 1D CNN is particularly effective for processing sequential data, such as text sequences.

For example, in Tibetan speech synthesis, consider the Latin representation “bkra shis”, which is the result of transliterating Tibetan text into Latin letters. A 1D CNN applies convolutional filters to this sequence of Latin letters. Using a filter of length three, the network slides over the sequence and processes overlapping subsets such as “bkr”, “kra”, “ra s”, and so on. This convolutional operation enables the model to capture local dependencies and patterns within the sequence, including specific combinations of letters or syllables. This capability to learn from adjacent characters or syllables is essential for generating accurate and natural Tibetan speech.

2.2.2. Predictor Architecture

The duration predictor is composed of a two-layer 1D convolutional architecture, featuring a ReLU activation layer, a normalization layer, and a dropout layer following each 1D convolutional layer. Subsequently, an additional linear layer transforms the hidden features into a scalar corresponding to the number of characters, indicating the number of frames in which each character is pronounced. Inputs to the duration predictor include the hidden features from the character encoder, while the outputs are the actual pronunciation durations of each character. The structure of the duration predictor is shown in Figure 3c.

In addition to the duration predictor, this paper introduces pitch and energy predictors structured similarly to the aforementioned duration predictor. The primary goal of the pitch predictor is to accurately forecast contour changes in fundamental frequency (F0) within speech signals, which effectively convey different emotional states. To extract F0, data processing from the raw audio waveform is employed using a hop size identical to that of the target Mel spectrogram to ensure alignment between F0 time frames and Mel spectrogram frames. Initially, F0 is computed at each time frame and then quantized into 256 possible values representing different pitch levels. Subsequently, the quantized F0 is transformed into a sequence of one-hot vectors for further processing.

The energy predictor determines the volume of sound signals by computing the L2 norm of the short-time Fourier transform (STFT) for each frame. This method converts each frame of sound signal into a numerical value reflecting its intensity or loudness, which is quantized into 256 continuous scale values.

During the training phase, pitch and energy information calculated directly from the original speech are utilized. The predictors’ outputs are optimized by comparing their mean squared errors with real F0 and energy values. In the inference phase, trained pitch and energy predictors are used to predict F0 and energy for each time frame.

The introduction of these predictors enables our end-to-end system to capture pitch and energy variations within speech signals more accurately, thereby enhancing the naturalness and expressiveness of speech synthesis. Future research will continue to refine these predictors, exploring more effective feature extraction and model training strategies to further improve the quality and performance of speech synthesis.

2.2.3. Length Regulator

Suppose there is a sequence of character hidden representations:

H_{c h a r} = [h_{1}, h_{2}, \dots, h_{n}]

, which satisfies

\sum_{i = 1}^{n} d_{i} = m

, where m represents the spectrum length,

L R

represents the length regulator, and

D

represents the character duration frames. The hidden feature for spectrum generation,

H_{c h a r}

is derived from the following equation:

H_{m e l} = LR (H_{c h a r}, D, α),

(1)

where the parameter

α

determines the length of the extended sequence, thereby regulating the rate of speech. In the inference stage, this paper uses the duration generated by a duration predictor as the input.

To illustrate this process more clearly, consider the example shown in Figure 4. The figure demonstrates the transformation from

H_{c h a r}

to

H_{m e l}

using the length regulator and the character duration frames. In this specific example, when

α = 1

and

D = [2, 3, 1]

, the length regulator adjusts the sequence by repeating each character frame according to the corresponding duration, resulting in the final sequence in

H_{m e l}

.

2.2.4. Character-to-Letter Attention Module

This study introduces a character-to-letter attention module that extends the hidden features at the Latin letter level. Specifically,

H_{c h a r}^{'}

serves as the query vector

q

, with

H_{l}

functioning as both the key

k

and the value

v

. The formula for calculating the attention weights is as follows:

α_{i} (q, k_{i}) = \frac{exp (s (k_{i}, q))}{{sum}_{j = 1}^{N} exp (s (k_{j}, q))},

(2)

s (k_{i}, q) = k_{i}^{T} W_{q k} q,

(3)

where

W_{q k}

represents the learnable transformation matrix, and i denotes the attentional weight on the hidden feature of the i-th Latin letter. Then, using the computed attentional weights

α_{i}

, the output for each query vector is calculated as follows:

y (q, k, v) = {sum}_{i = 1}^{N} α_{i} (q, k_{i}) W_{v} v_{i},

(4)

where

W_{v}

represents another learnable transformation matrix. To date, this study has achieved a Latin letter representation matching the spectrum’s length

H_{l}^{'} = y (q, k, v)

. To ensure that the query vector only accesses key-value pairs within the same character, a mask matrix based on the character boundary information from the input text encoder is constructed. This matrix is then incorporated into the attention weight

α

matrix. Additionally, to account for the monotonic alignment of Latin letters with speech, positional encoding information is added to

q

,

k

, and

v

. For

k

and

v

, this positional encoding information is

\frac{i}{L_{c}} E_{k v}

, where i represents the position of each Latin letter within its corresponding character,

L_{c}

represents the total number of Latin letters in character c, and

E_{k v}

is a learnable vector embedding. For

q

, the position encoding information is

\frac{j}{T_{c}} E_{q}

, where j represents the frame’s position within the character,

T_{c}

represents the total number of frames in the character, and

E_{q}

is another learnable vector embedding.

To illustrate this process, Figure 5 shows the calculation principle of character-to-letter attention. This figure helps to visualize how the attention weights

α_{i}

are computed and applied to obtain the final output

y (q, k, v)

.

2.3. Decoder Architecture

The decoder architecture is the same as the character encoder architecture. First, the decoder receives vectors encoded with positional information by the mixture alignment encoder. These vectors contain positional and contextual information of the input data. Next, the decoder passes these vectors to the Mel linear module. The role of this module is to linearly project the hidden states into the Mel spectrogram, thereby generating the corresponding speech features. The Mel spectrogram is an important feature representation in speech synthesis. It mimics the human ear’s perception by applying a nonlinear transformation to the speech spectrum. Finally, the Mel spectrogram output by the decoder is used to generate high-quality speech signals.

2.4. HiFi-GAN Vocoder

To optimize both synthesis speed and quality, this study utilizes the HiFi-GAN [23] vocoder to convert the speech spectrum into waveform audio more efficiently and with enhanced fidelity. The HiFi-GAN vocoder comprises a generator and two discriminators. The core of the generator is a convolutional neural network that processes the Mel spectrum as input and produces outputs analogous to real speech. The discriminators continuously up-sample the Mel spectrum until the output sequence matches the time-domain resolution of the original waveform, assessing the authenticity of the synthesized speech. Upon completion of training, the generator is employed directly during synthesis to produce the speech waveform signal.

2.5. Pre-Training Method

Currently, the amount of available Tibetan language corpus is relatively limited, and relying solely on parallel Tibetan corpora for system training yields suboptimal results. Inspired by transfer learning, this paper proposes a pre-training method to address this issue.

In the context of deep neural network-based transfer learning methods, fine-tuning is one of the simplest and most convenient approaches. Fine-tuning involves pre-training a model on one task and then adapting it to a related task by training it on new data specific to the target domain. This method significantly reduces training time and leverages features already learned by the pre-trained model, thereby enhancing performance on the new task. While fine-tuning does increase some training costs, its significant advantage lies in accelerating the training process, reducing the amount of data required and improving the final model’s performance. These benefits typically outweigh the additional costs. For resource-scarce languages like Tibetan, this pre-training method is particularly valuable. Compared to English and Chinese, Tibetan has very limited linguistic resources. By pre-training a model on a more resource-rich language and then fine-tuning it with Tibetan data, we can effectively utilize the extensive resources available for other languages, thereby improving the performance of Tibetan speech synthesis systems. This approach not only compensates for the lack of Tibetan data but also ensures the system performs well even with limited data.

Specifically, we first train an English speech synthesis system using “text-to-audio” data from a large English dataset. For this purpose, we used the publicly available LJSpeech dataset [24], which contains high-quality text-to-audio pairs that have been carefully selected and processed to ensure suitability for pre-training. In the initial training phase, we train the FastSpeech2 and mixture alignment FastSpeech2 models using the LJSpeech dataset. This process includes initializing the model with predefined parameters and conducting several training iterations to gradually learn the mapping between English text and corresponding speech. During training, we use the loss functions mean absolute error (MAE) and mean squared error (MSE) to optimize the model parameters. To prevent overfitting, we periodically evaluate the model’s performance using a validation set.

After completing the initial training with the English dataset, we proceed with transfer learning. In this phase, we keep certain encoder and decoder weights fixed, particularly those layers that have effectively captured speech features. We then retrain the model using “text-to-audio” data in the target language to adapt the model to the new language characteristics. By adjusting the weights of the unfrozen layers, the model gradually learns the features of the target language.

This transfer learning method not only accelerates the decoder’s learning of speech feature information but also swiftly establishes correspondence between phonetic and textual units in the target language, akin to those in the source language. Furthermore, it effectively leverages encoder information to expedite the establishment of the language model and the extraction of text information.

3. Experiments and Results

3.1. Datasets

In this study, we used two datasets to train and pre-train the model: the Tibetan Lhasa dialect dataset and the LJSpeech dataset [24].

We utilized the Tibetan multi-dialect speech dataset from the Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE at Minzu University of China to train the model. In this experiment, we employed the 16 kHz, 16-bit mono Tibetan Lhasa dialect dataset. This dataset, which includes recordings from four different speakers, has a total duration of approximately 7.47 h and contains 6452 speech samples, with an average duration of approximately 4 s per sentence.

To further improve the model’s performance, we also used the LJSpeech dataset for pre-training. LJSpeech is a widely used, open, single-speaker speech dataset, featuring recordings of a female speaker. This dataset has a total duration of approximately 24 h, containing 13,100 speech samples, with an average duration of 6.57 s per sentence. The audio specifications are 22.05 kHz, 16-bit mono.

For consistency, we downsampled the LJSpeech audio to 16 kHz to match the sampling rate of the Tibetan Lhasa dialect dataset. This ensures uniformity in audio processing across both pre-training and training stages.

In the training process, we used two GPUs for parallel processing, with the batch size set to 32. The output of the acoustic model is an eighty-dimensional Mel spectrogram, which is converted into audio samples using the pre-trained HiFi-GAN vocoder. The generated audio signals have a sampling frequency of 16 kHz.

3.2. Parameter Setup

In deep learning neural networks, both hyperparameters and network architecture are crucial factors. Hyperparameters refer to those parameters that need to be manually set, and they significantly influence the model’s performance and convergence speed. Improper choices can lead to issues such as failure to converge or overfitting. Therefore, selecting appropriate hyperparameters is paramount. The hyperparameters and network architecture settings in this paper are shown in the Table 1.

3.3. Feature Prediction

The Mel spectrogram visualizes the effects of synthesized speech, and the more detailed the Mel spectrogram, the higher the potential quality of the synthesized audio. Mel spectrograms synthesized by each model are depicted in Figure 6. We observed that the mixture alignment FastSpeech2 model exhibits an advantage in predicting Mel spectrogram details.

In the analysis of the attention alignment graph, dots represent attention weight values; the higher these values, the brighter the dots. The alignment graph reflects the stability of the speech synthesis model. The clarity of the lines in the graph indicates higher accuracy of speech synthesis and enhanced model stability. The alignment graph generated by the mixture alignment FastSpeech2 model is depicted in Figure 7, its clear and bright lines indicate good alignment capabilities.

3.4. Model Loss

In this section, we analyze the training losses for the mixture alignment FastSpeech2 model, focusing on duration, energy, and pitch losses. The overall model loss, which includes these components, is also discussed.

The losses for duration, energy, and pitch are calculated using the mean squared error (MSE) function:

MSE Loss = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2},

(5)

where

y_{i}

represents the true value,

{\hat{y}}_{i}

is the predicted value, and N is the number of samples.

3.4.1. Duration Loss

Duration loss measures the error in predicting the duration of characters:

Duration Loss = MSE (\hat{D}, D),

(6)

where

\hat{D}

is the predicted duration, and D is the ground truth duration. Duration loss converges quickly and typically remains lower than energy and pitch losses.

3.4.2. Energy Loss

Energy loss quantifies the discrepancy in predicted energy levels:

Energy Loss = MSE (\hat{E}, E),

(7)

where

\hat{E}

is the predicted energy, and E is the ground truth energy. Energy loss generally follows a trend similar to pitch loss.

3.4.3. Pitch Loss

Pitch loss assesses the error in predicting pitch values:

Pitch Loss = MSE (\hat{P}, P),

(8)

where

\hat{P}

is the predicted pitch, and P is the ground truth pitch. Compared to duration and energy losses, pitch loss exhibits significant oscillations.

3.4.4. Overall Model Loss

The overall model loss decreases as training progresses, indicating improved model performance. As training advances, the individual component losses stabilize, and the reduction in overall loss reflects the model’s enhanced ability to synthesize speech accurately.

The trends in training losses are shown in Figure 8, which illustrates the changes in duration loss, energy loss, pitch loss, and overall model loss during the training process.

3.5. Evaluation Indicators and Experimental Results

We employed subjective and objective evaluation methods to evaluate synthesized speech.

3.5.1. Subjective Evaluation Indicators

The subjective assessment method, the mean opinion score (MOS), is commonly utilized. The MOS assessment criteria are shown in Table 2.

For the subjective evaluation, five Tibetan university students were invited and ten Tibetan texts were randomly selected. They were asked to score the same text for listening, and were provided the speech synthesized with different models using the original speech as a reference. Ultimately, the average score from all the listeners was calculated as the final result.

3.5.2. Subjective Evaluation Results

Subjectively, ten Tibetan texts were randomly selected to compare the speech synthesis effects of FastSpeech2 and Mixture Alignment FastSpeech2. Furthermore, experiments utilized pre-trained models on the LJSpeech dataset. Subsequently, five Tibetan participants rated the synthesized speech using different models. The MOS was then calculated, and results are displayed in Table 3. In the first row of the table, “Ground Truth” refers to the original recordings of the selected Tibetan texts used as a reference for evaluating the quality of synthesized speech.

To assess the effectiveness of the speech synthesis models, we first conducted an analysis of variance (ANOVA) to determine if there were any statistically significant differences among the mean opinion scores (MOS) of the various models. The ANOVA results revealed a significant overall difference (F-value = 11.90, p < 0.001), indicating that at least one of the models differed significantly from the others.

To further investigate which specific pairs of models exhibited significant differences, we performed a post-hoc Tukey HSD test. The results of this test are summarized in Table 4. The table shows the adjusted p-values for comparisons between each model and the mixture alignment FastSpeech2+ pre-trained model.

The Tukey HSD test revealed that the mixture alignment FastSpeech2+ pre-trained model showed a significant improvement over both FastSpeech2 (p = 0.0065) and the FastSpeech2+ pre-trained model (p = 0.0065), confirming its superior performance.

Following the ANOVA, we performed a post-hoc Tukey HSD test to identify which specific pairs of models exhibited significant differences. The Tukey HSD results are summarized as follows: the mixture alignment FastSpeech2+ pre-trained model demonstrated a significant improvement over both FastSpeech2 (p = 0.0065) and the FastSpeech2+ pre-trained model (p = 0.0065), confirming its superior performance.

3.5.3. Objective Assessment Method

In the objective evaluation, this paper uses two key metrics: the root mean square error (RMSE) and the real-time factor (RTF) of the speech synthesis process.

Root Mean Square Error (RMSE)

RMSE measures the differences between the original and synthesized speech in the time domain sequence. It quantifies the average magnitude of errors between the predicted and observed values. The calculation method is depicted in Equation (9):

R M S E = \sqrt{\frac{\sum_{t = 1}^{n} {(x_{1, t} - x_{2, t})}^{2}}{n}},

(9)

where

x_{1, t}

and

x_{2, t}

represent the speech sequences of the original speech and the synthesized speech at moment t. The smaller the RMSE value, the greater the similarity between the original and synthesized speeches, and the higher the quality of the synthesized speech.

Real-Time Factor (RTF)

The real-time factor (RTF) measures the efficiency of the speech synthesis process. It represents the ratio of the time required to process a given duration of audio to the actual duration of the audio. The RTF is calculated using the following equation:

R T F = \frac{Processing Time}{Audio Duration},

(10)

where “Processing Time” is the total time taken by the system, including all components, to process the audio, and “Audio Duration” is the length of the audio that is being processed. A smaller RTF value indicates more efficient synthesis, meaning the system processes speech faster and is more suitable for real-time applications.

3.5.4. Objective Evaluation Results

The evaluation results for the objective metrics are presented in the following tables.

RMSE Results

Table 5 shows the RMSE results for different models. A smaller RMSE value indicates better performance of the model in terms of synthesizing speech that closely matches the original speech.

RTF Results

Table 6 presents the RTF results for the various models. A smaller RTF value indicates more efficient synthesis, with the model requiring less time to synthesize audio.

As indicated in Table 5 and Table 6, the RMSE values for the mixture alignment FastSpeech2 model decreased, reflecting improved accuracy in speech synthesis. The utilization of the pre-trained model on the English dataset resulted in a smaller RMSE compared to models without pre-training, enhancing the quality of synthesized speech. Additionally, the RTF values demonstrate that the mixture alignment FastSpeech2+ pre-trained model processes speech more efficiently, achieving the lowest RTF value of 0.186. This indicates that this model is not only more accurate but also more efficient in real-time speech synthesis applications.

Training Time and Inference Speed

Table 7 compares the training times and inference speeds of different models. This comparison highlights the practical aspects of model deployment and efficiency.

When comparing the Transformer TTS and mixture alignment FastSpeech2 models, the latter demonstrates significant advantages in both training time and inference speed, as detailed in Table 7. Mixture alignment FastSpeech2 requires only 45 h for training, markedly less than the 122 h needed by Transformer TTS. This reduction is largely due to the autoregressive nature of Transformer TTS, which necessitates a longer training period to capture detailed speech data, whereas mixture alignment FastSpeech2 employs a more efficient alignment mechanism. In terms of inference speed, mixture alignment FastSpeech2 outperforms Transformer TTS with a real-time factor of 0.193 compared to 0.984 for Transformer TTS, highlighting its superior efficiency. Although mixture alignment FastSpeech2 has a slightly longer training time compared to FastSpeech2 (37 h), this is due to the additional computations and adjustments required by its mixed alignment mechanism. Nonetheless, this added complexity results in a substantial increase in inference speed, making mixture alignment FastSpeech2 a more practical and efficient choice for real-world applications.

Both subjective and objective evaluation results indicate that using a pre-trained model has led to higher quality and naturalness in speech synthesis. We speculate that the reasons for this improvement may be as follows: firstly, the pre-trained model adeptly assimilates intricate feature representations through its training on the LJSpeech dataset, seamlessly transferring these acquired insights to the realm of Tibetan speech synthesis. Secondly, the pre-trained model demonstrates a knack for absorbing pertinent audio and acoustic nuances inherent in the speech, thereby enhancing its capacity to faithfully capture the underlying structure and essence of the spoken content. Lastly, the pre-trained model exhibits a commendable propensity for generalization, empowering it to navigate through uncharted data realms and complex scenarios with aplomb.

3.6. Ablation Studies

To understand the contributions of different components in our mixture alignment FastSpeech2 model, we conducted ablation studies. Specifically, we evaluated the impact of the pitch predictor, energy predictor, and mixture alignment mechanism on the overall performance. We conducted comparative mean opinion score (CMOS) evaluation for these ablation studies. The results are shown in Table 8. We found that removing the energy predictor in mixture alignment FastSpeech2 results in a performance drop in terms of voice quality, indicating that the energy predictor is effective in improving voice quality. Additionally, removing the pitch predictor results in CMOS decreases of −0.11, demonstrating the effectiveness of the pitch predictor. When both pitch and energy predictors are removed, the voice quality drops even further, indicating that both pitch and energy predictors are crucial for enhancing the performance of mixture alignment FastSpeech2.

To validate the effectiveness of the mixture alignment mechanism, we compared mixture alignment FastSpeech2 with mixture alignment against FastSpeech2 with hard alignment. The experimental results are shown in row 5 of Table 8. The results indicate that mixture alignment FastSpeech2 performed better. This finding confirms the effectiveness of the mixture alignment mechanism.

4. Conclusions

This study presented an end-to-end model incorporating a mixture alignment mechanism based on FastSpeech2 for Tibetan speech synthesis. Firstly, the Tibetan text underwent preprocessing, with the characters transcribed into Latin letters following the Wylie transcription rules. Then, the mixture alignment FastSpeech2 model was proposed to mitigate errors arising from the rigid alignment of Latin letters. Additionally, to bolster the model’s generalization capabilities, the synthesis model underwent training using an English dataset before transferring the pre-trained model to Tibetan speech synthesis.

Experimental results demonstrate significant improvements in the synthesized speech quality achieved by the mixed alignment FastSpeech2 model. Despite the observed oscillations during the learning of pitch and energy features, these provide directions for further optimization and improvement. Future efforts will focus on integrating features such as pitch, energy, and speech styles to enhance the naturalness and expressiveness of synthesized speech.

Future research will explore personalized speech synthesis by integrating speaker information into models to customize generated speech based on individual characteristics and styles. Personalized speech synthesis not only tailors speech generation to speaker attributes but also enhances the user’s experience by meeting specific needs.

With advancements in artificial intelligence and deep learning technologies, large models demonstrate immense potential in speech synthesis. Models like GPT-4, T5, and BERT possess strong learning and generalization capabilities, better capturing and expressing language complexity. Future applications aim to further improve Tibetan speech synthesis quality and effectiveness by leveraging these large models. By incorporating multilingual training and cross-lingual learning, these models will utilize diverse data resources from other languages to further enhance Tibetan speech synthesis.

Addressing technical challenges will involve exploring data augmentation techniques, model optimization strategies, and enhancing computational efficiency. These efforts will elevate Tibetan speech synthesis technical standards and broaden its applications in education, cultural dissemination, and intelligent assistants. In conclusion, large models show promising application prospects in Tibetan speech synthesis. Continuous innovation and optimization aim to contribute possibilities and opportunities for Tibetan language development and dissemination.

Author Contributions

Methodology, Q.Z.; Validation, Q.Z.; Writing—original draft, Q.Z.; Writing—review & editing, Q.Z. and X.X.; Supervision, X.X. and Y.Z.; Project administration, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China grant number 61976236.

Data Availability Statement

The datasets presented in this article are not readily available because they are part of an ongoing study. Requests to access the datasets should be directed to 22302107@muc.edu.cn.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yuan, B.; Xiao, B.; Hou, Y.; You, S. Current Status and Development Trends of Voice Interaction Technology on Mobile Intelligent Terminals. Inf. Commun. Technol. 2014, 8, 39–43. [Google Scholar]
Jia, Y.; Zhang, Y.; Weiss, R.; Wang, Q.; Shen, J.; Ren, F.; Nguyen, P.; Pang, R.; Lopez Moreno, I.; Wu, Y.; et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 4485–4495. [Google Scholar]
Arık, S.Ö.; Chrzanowski, M.; Coates, A.; Diamos, G.; Gibiansky, A.; Kang, Y.; Li, X.; Miller, J.; Ng, A.; Raiman, J.; et al. Deep voice: Real-time neural text-to-speech. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 195–204. [Google Scholar]
Sotelo, J.; Mehri, S.; Kumar, K.; Santos, J.F.; Kastner, K.; Courville, A.; Bengio, Y. Char2wav: End-to-End Speech Synthesis; International Speech Communication Association: Stockholm, Sweden, 2017. [Google Scholar]
Donahue, C.; McAuley, J.; Puckette, M. Adversarial audio synthesis. arXiv 2018, arXiv:1802.04208. [Google Scholar]
Van Den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
Wang, Y.; Skerry-Ryan, R.; Stanton, D.; Wu, Y.; Weiss, R.J.; Jaitly, N.; Yang, Z.; Xiao, Y.; Chen, Z.; Bengio, S.; et al. Tacotron: Towards end-to-end speech synthesis. arXiv 2017, arXiv:1703.10135. [Google Scholar]
Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerrv-Ryan, R.; et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4779–4783. [Google Scholar]
Skerry-Ryan, R.; Battenberg, E.; Xiao, Y.; Wang, Y.; Stanton, D.; Shor, J.; Weiss, R.; Clark, R.; Saurous, R.A. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018; pp. 4693–4702. [Google Scholar]
Li, N.; Liu, S.; Liu, Y.; Zhao, S.; Liu, M. Neural speech synthesis with transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Honolulu, HI, USA, 27 January–1 February 2019; pp. 6706–6713. [Google Scholar]
Ren, Y.; Ruan, Y.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.Y. Fastspeech: Fast, robust and controllable text to speech. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Ren, Y.; Hu, C.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.Y. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv 2020, arXiv:2006.04558. [Google Scholar]
Zhou, Y.; Zhao, D. Research on HMM-based Tibetan speech synthesis. Comput. Appl. Softw. 2015, 32, 171–174. [Google Scholar]
Xie, X. Research on the Speech Synthesis Technology of Tibetan Dialect in Lhasa-Ü-Tsang. Master’s Thesis, Department of Information Science and Technology, Tibet University, Lhasa, China, 2021. [Google Scholar]
Griffin, D.; Lim, J. Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 1984, 32, 236–243. [Google Scholar] [CrossRef]
Zhao, Y.; Hu, P.; Xu, X.; Wu, L.; Li, X. Lhasa-Tibetan speech synthesis using end-to-end model. IEEE Access 2019, 7, 140305–140311. [Google Scholar] [CrossRef]
Tian, C.; Ma, J.; Zhang, C.; Zhan, P. A deep neural network model for short-term load forecast based on long short-term memory network and convolutional neural network. Energies 2018, 11, 3493. [Google Scholar] [CrossRef]
Xu, X.; Li, N.; Zhao, Y. Transformer Speech Synthesis for Tibetan Lhasa Dialect Based on Multilevel Granular Unit. In Proceedings of the 2023 IEEE 6th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), Haikou, China, 18–20 August 2023; pp. 864–869. [Google Scholar]
Zu, B.; Cai, R.; Cai, Z.; Pengmao, Z. Research on Tibetan Speech Synthesis Based on Fastspeech2. In Proceedings of the 2022 3rd International Conference on Pattern Recognition and Machine Learning (PRML), Chengdu, China, 22–24 July 2022; pp. 241–244. [Google Scholar]
McAuliffe, M.; Socolof, M.; Mihuc, S.; Wagner, M.; Sonderegger, M. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; Volume 2017, pp. 498–502. [Google Scholar]
Ren, Y.; Liu, J.; Zhao, Z. Portaspeech: Portable and high-quality generative text-to-speech. In Proceedings of the 35th International Conference on Neural Information Processing Systems, Online, 6–14 December 2021; Volume 34, pp. 13963–13974. [Google Scholar]
Yang, L.; Yang, J.; Cai, H.; Liu, C. Vietnamese speech synthesis based on transfer learning. Comput. Sci. 2023, 50, 118–124. [Google Scholar]
Kong, J.; Kim, J.; Bae, J. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Volume 33, pp. 17022–17033. [Google Scholar]
Ito, K.; Johnson, L. The LJSpeech Dataset. 2017. Available online: https://keithito.com/LJ-Speech-Dataset/ (accessed on 29 July 2024).

Figure 1. The overall architecture of the model. The Tibetan text in the figure, which appears in the text preprocessing section, translates to ’Auspicious and Prosperous’.

Figure 2. Tibetan sentences and Wylie transcription. The Tibetan sentences in the figure translates to ’Also includes a frying pan and other cooking utensils’.

Figure 3. Overview of mixture alignment encoder and its components. (a) Mixture alignment encoder. (b) Encoder architecture. (c) Predictor architecture.

Figure 4. Length regulator.

Figure 5. Character-to-letter attention calculation principle.

Figure 6. Feature predictions from different models. (a) Real spectrogram. (b) FastSpeech2. (c) FastSpeech2+ pre-trained model. (d) Mixture alignment FastSpeech2. (e) Mixture alignment FastSpeech2+ pre-trained model. The red box in figure marks the magnified region.

Figure 7. Attention alignment.

Figure 8. Training loss curves. (a) Duration loss. (b) Energy loss. (c) Pitch loss. (d) Total loss.

Table 1. Hyperparameters of Model Modules.

Module Name	Hyperparameters
Embedding	256
Letter/Character Encoder Layers	4
Encoder Hidden Size	256
Encoder Conv1D Kernel	5
Encoder Conv1D Filter Size	1024
Predictor Conv1D Kernel	3
Predictor Conv1D Filter Size	256
Predictor Dropout	0.5
Decoder Layers	6
Decoder Hidden	256
Decoder Conv1D Kernel	5
Decoder Conv1D Filter Size	1024
Encoder/Decoder Dropout	0.2
Total Number of Parameters	29.19 M

Table 2. MOS Scoring Criteria.

Voice Quality	MOS Evaluation Criteria	Grading Value
Very Good	Clear voice, smooth communication, and low latency	5
Good	Voice is clearer but communication is a bit difficult with murmurs	4
Medium	The voice is not very clear, but communication is possible with a slight delay	3
Bad	Unclear sound, requires multiple repetitions to communicate, and has high latency	2
Very Bad	Completely unintelligible and very difficult to communicate with a large delay	1

Table 3. Subjective evaluation results.

Model	MOS
Ground truth	4.52 ± 0.09
FastSpeech2	3.65 ± 0.08
FastSpeech2+ pre-trained model	3.71 ± 0.08
Mixture alignment FastSpeech2	3.76 ± 0.07
Mixture alignment FastSpeech2+ Pre-trained model	3.82 ± 0.09

Table 4. Post-hoc test results.

Model	p-adj
FastSpeech2	0.0065
FastSpeech2+ pre-trained model	0.0065
Mixture alignment FastSpeech2	0.3734

Table 5. RMSE evaluation results for different models.

Model	RMSE
FastSpeech2	0.38
FastSpeech2+ pre-trained model	0.28
Mixture alignment FastSpeech2	0.32
Mixture alignment FastSpeech2+ pre-trained model	0.24

Table 6. RTF evaluation results for different models.

Model	RTF
FastSpeech2	0.216
FastSpeech2+ pre-trained model	0.204
Mixture alignment FastSpeech2	0.193
Mixture alignment FastSpeech2+ pre-trained model	0.186

Table 7. Comparison of model training time and inference speed.

Model	Training Time (h)	Inference Speed (RTF)
Transformer TTS	122	0.984
FastSpeech2	37	0.216
Mixture alignment FastSpeech2	45	0.193

Table 8. CMOS comparison in the ablation studies.

Setting	CMOS
Mixture alignment FastSpeech2	0
Mixture alignment FastSpeech2—energy	−0.05
Mixture alignment FastSpeech2—pitch	−0.11
Mixture alignment FastSpeech2—energy—pitch	−0.18
FastSpeech2	−0.23

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, Q.; Xu, X.; Zhao, Y. Tibetan Speech Synthesis Based on Pre-Traind Mixture Alignment FastSpeech2. Appl. Sci. 2024, 14, 6834. https://doi.org/10.3390/app14156834

AMA Style

Zhou Q, Xu X, Zhao Y. Tibetan Speech Synthesis Based on Pre-Traind Mixture Alignment FastSpeech2. Applied Sciences. 2024; 14(15):6834. https://doi.org/10.3390/app14156834

Chicago/Turabian Style

Zhou, Qing, Xiaona Xu, and Yue Zhao. 2024. "Tibetan Speech Synthesis Based on Pre-Traind Mixture Alignment FastSpeech2" Applied Sciences 14, no. 15: 6834. https://doi.org/10.3390/app14156834

APA Style

Zhou, Q., Xu, X., & Zhao, Y. (2024). Tibetan Speech Synthesis Based on Pre-Traind Mixture Alignment FastSpeech2. Applied Sciences, 14(15), 6834. https://doi.org/10.3390/app14156834

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Tibetan Speech Synthesis Based on Pre-Traind Mixture Alignment FastSpeech2

Abstract

1. Introduction

2. Model Architecture

2.1. Tibetan Text Preprocessing

2.2. Mixture Alignment Encoder

2.2.1. Encoder Architecture

2.2.2. Predictor Architecture

2.2.3. Length Regulator

2.2.4. Character-to-Letter Attention Module

2.3. Decoder Architecture

2.4. HiFi-GAN Vocoder

2.5. Pre-Training Method

3. Experiments and Results

3.1. Datasets

3.2. Parameter Setup

3.3. Feature Prediction

3.4. Model Loss

3.4.1. Duration Loss

3.4.2. Energy Loss

3.4.3. Pitch Loss

3.4.4. Overall Model Loss

3.5. Evaluation Indicators and Experimental Results

3.5.1. Subjective Evaluation Indicators

3.5.2. Subjective Evaluation Results

3.5.3. Objective Assessment Method

Root Mean Square Error (RMSE)

Real-Time Factor (RTF)

3.5.4. Objective Evaluation Results

RMSE Results

RTF Results

Training Time and Inference Speed

3.6. Ablation Studies

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI