1. Introduction
Speech synthesis is a technology that converts text into speech signals by analyzing textual features and mapping them onto acoustic representations. It utilizes digital signal processing methods to produce audible speech, enabling the transformation of text into human-like speech that users can comprehend through auditory means. Within artificial intelligence, speech synthesis integrates speech processing techniques with natural language processing and automation, driving advancements in human–machine interaction [
1].
Advancements in technology and improvements in computer performance have made human–machine speech interaction a pivotal research focus in artificial intelligence. Speech synthesis, as a core technology, has garnered unprecedented attention in recent years [
2]. Current research extensively explores deep learning techniques to generate realistic and fluent human-like voices, ranging from convolutional neural networks [
3] and recurrent neural networks [
4] to generative adversarial networks [
5]. These advancements include the application of reinforcement learning, enabling deep learning models to adapt voice synthesis based on contextual cues. Contemporary speech synthesis has evolved from traditional waveform, parameter, and rule-based methods to those based on deep learning.
Since the early 21st century, deep learning technologies, particularly neural networks applied in speech synthesis, have rapidly advanced. Google’s WaveNet model [
6], introduced in 2016, marked a significant milestone in deep learning-based speech synthesis technology. WaveNet directly generates speech waveforms using deep convolutional neural networks, significantly enhancing the naturalness and quality of synthesized speech. Subsequent models like Tacotron [
7] and Tacotron2 [
8] further propelled text-to-speech conversion towards end-to-end models, achieving more efficient and natural speech synthesis. Tacotron has also been utilized in voice cloning, multilingual synthesis, and emotional speech synthesis [
2,
9]. Current deep learning approaches such as Transformer TTS [
10], FastSpeech [
11], and FastSpeech2 [
12] not only enhance synthesis efficiency and speech quality but also reduce computational resource demands, enabling broader practical applications.
In the field of Tibetan speech synthesis, technological development has transitioned from traditional methods to modern end-to-end approaches. Initial Tibetan speech synthesis systems heavily relied on HMM statistical parameter models [
13], demonstrating effectiveness with limited data but facing challenges in naturalness and audio quality. With the introduction of deep learning technologies, Tibetan speech synthesis has gradually shifted towards end-to-end neural network models.
Initially, a Tibetan speech synthesis model was developed based on the Tacotron framework [
14], utilizing the classic Griffin–Lim vocoder [
15] to generate speech waveforms, which generated a smooth waveform and led to a poor quality of synthesized speech. To address this issue, Tacotron2 emerged and was applied to speech synthesis in Tibetan Lhasa [
16]. Tacotron2 employs the WaveNet vocoder to enhance the quality of synthesis; nevertheless, its reliance on the LSTM [
17] model for encoding contextual information presents challenges in constructing models with long-term dependencies. Conversely, Transformer TTS [
18] relies entirely on the self-attention mechanism and can effectively manage long-term dependencies by constructing context vectors from various perspectives using the multi-head attention mechanism. However, Transformer TTS is hindered by its inability to parallelize the inference process and the issue of misaligned attention codecs. In response, the FastSpeech2 method has been applied for Tibetan speech synthesis [
19]. Since the FastSpeech2 model utilizes a non-autoregressive approach, it boasts a faster generation speed than traditional autoregressive models. Furthermore, its flexibility in modeling long-term dependencies enhances both the quality and smoothness of speech synthesis, making the FastSpeech2 model highly regarded in current research.
In this paper we apply the FastSpeech2 model to Tibetan speech synthesis. But, in FastSpeech2, the external alignment tool MFA (Montreal forced alignment) [
20] is utilized to ascertain the duration of Latin letters, resulting in errors and noise during the alignment of Latin letters due to the absence of distinct boundaries between different letters. To address this problem, we propose the mixture alignment FastSpeech2 model, which utilizes the idea of a mixture alignment mechanism in the linguistic encoder in PortaSpeech [
21]. This model employs soft alignment at the level of Latin letters and hard alignment at the level of Tibetan characters, thereby enhancing semantic information comprehension and mitigating errors and noise attributed to the hard alignment of Latin letters.
In addition, the majority of end-to-end speech synthesis systems require a substantial volume of paired high-quality speech data for model training [
22]. But, the efficacy of directly applying the FastSpeech2 model to Tibetan speech synthesis is hindered by the objective lack of large-scale, high-quality Tibetan speech–text pairs. Consequently, this study pre-trains the FastSpeech2 baseline model using an English dataset, followed by transitioning the pre-trained model to Tibetan speech synthesis. The experimental results demonstrate that our proposed mixture alignment FastSpeech2 model excels in synthesizing audio quality, predicting Mel spectrograms, and achieving precise attention alignment. Furthermore, the use of pre-trained models has also contributed to some extent in enhancing the quality of synthesized speech.
The main contributions of this work are summarized as follows:
We address the issue of misalignment caused by hard alignment at the Latin letter level by integrating the concept of mixture alignment into the FastSpeech2 model. Our proposed model, termed mixture alignment FastSpeech2, is applied to Tibetan speech synthesis, demonstrating significant improvements in alignment accuracy and synthesis quality.
Given the limited availability of Tibetan data, we advocate for the transfer of pre-trained model trained on larger datasets to the Tibetan language domain. This approach aims to enhance the naturalness of synthesized speech by leveraging the rich representations learned from abundant resources.
In
Section 2, we describe the constructed end-to-end speech synthesis architecture, including the model used and its improvement.
Section 3 describes the experimental details and results. Finally,
Section 4 summarizes the full paper.
2. Model Architecture
The mixture alignment FastSpeech2 model comprises four main components: Tibetan text preprocessing, the mixture alignment encoder, the decoder, and the HiFi-GAN vocoder [
23].
Figure 1 provides an overview of the model structure.
2.1. Tibetan Text Preprocessing
The majority of contemporary Tibetan speech synthesis models employing deep neural networks utilize Tibetan phonemes as their foundational synthesis units. Converting Tibetan text into phonemes necessitates a comprehensive and precise syllable lexicon. Characterized by its uniqueness and complexity, the Tibetan writing system possesses a rich character structure. This lexicon encompasses a myriad of potential syllable combinations, each corresponding to a distinct phonological unit. Constructing such a syllabic lexicon requires profound expertise in Tibetan linguistics and can be challenging due to the complexity and size of the lexicon. Additionally, even with a precise syllabic dictionary, there may still be potential noise.
To address these challenges and simplify the process, this paper opts to use Wylie transcription to convert Tibetan characters into Latin letters. Wylie transcription is a widely recognized system that provides a standardized approach to transliterating Tibetan script into Latin characters. This system maps Tibetan syllables to specific Latin letter combinations based on their phonetic properties, thus creating a one-to-one correspondence between Tibetan sounds and Latin representations. In the Wylie transcription system, the process of converting Tibetan characters into Latin letters includes the following rules:
Tibetan vowels and consonants are transcribed into specific Latin letters or combinations, ensuring consistency across different Tibetan words and phrases.
Tibetan syllables typically include a root letter along with possible prefixes, superscripts, subscripts, and postfixes. The Wylie transcription system provides specific rules for these components.
Special symbols and tonal marks in Tibetan are represented by specific Latin characters in the Wylie transcription system to reduce ambiguity and preserve phonetic detail. Additionally, some syllable combinations undergo phonological changes in specific contexts, and the Wylie system typically transcribes the original characters directly.
This method reduces the complexity and potential noise associated with managing a comprehensive syllabic lexicon, thereby providing a more efficient solution for synthesizing Tibetan speech. The use of Wylie transcription helps address lexical disparities by offering a consistent transliteration approach, which is crucial for accurate grapheme conversion.
In practical applications, the process of transcribing Tibetan characters into Latin letters involves several steps. Initially, Tibetan text is processed by removing unnecessary spaces and punctuation marks and segmenting them into individual syllables. Each Tibetan letter is subsequently matched to its corresponding Latin letter according to Wylie transcription rules, involving the identification and conversion of root letters, prefixes, superscripts, subscripts, and suffixes. Finally, these transcribed components are combined to generate the complete Wylie transcription string. Notably, the symbol “|” is omitted at the end of sentences, and the symbol “·” is transcribed as a space, simplifying the treatment of special symbols.
Figure 2 displays the Latin letter representations of certain Tibetan characters transcribed via Wylie, visually illustrating these steps and providing a clear and standardized Latin representation of Tibetan sentences while preserving the original letter order.
2.2. Mixture Alignment Encoder
Compared to the encoder in FastSpeech2, the mixture alignment encoder incorporates an additional Tibetan character encoder. The architecture of the mixture alignment encoder is depicted in
Figure 3a. CP represents the pooling operation at the Tibetan character level, and LR represents the length regulator introduced in FastSpeech [
11]. A sequence of Latin letters containing Tibetan character boundaries is passed as input to the encoder. The Latin letter encoder transforms the input sequence of Latin letters into a Latin letters hidden state, subsequently applying character pooling on the Latin letters hidden state to obtain input from the character encoder, and averaging the Latin letters hidden state within each character according to character boundary information. The character encoder processes the received Latin letters hidden states into character level hidden states, subsequently utilizing a length regulator to adjust the character level hidden states based on character-level durations in order to align with the length of the target Mel spectrogram. Lastly, the Latin letters hidden state and the character level hidden state are processed by character-level relative position encoding and then forwarded to the character-to-letter attention module.
In the mixture alignment model, the extension of the Latin letters’ hidden representations is achieved through the integration of the character level duration predictor and the character-to-letter attention mechanism within the mixture alignment encoder. The durations of Tibetan characters are determined by summing the durations of each Latin letter within a character to ensure precise alignment at the character level. Subsequently, the character-to-letter attention mechanism calculates the attentional weight of each Latin letter’s hidden representation within a character to achieve soft alignment at the Latin letter level. The advantage of the mixture alignment mechanism lies in the fact that maintaining precise alignment at the character level results in reduced errors compared to alignment at the Latin letter level. Furthermore, implementing soft alignment at the Latin letter level facilitates the acquisition of more fine-grained information.
2.2.1. Encoder Architecture
In this paper, both the letter encoder and the character encoder share an identical architecture. Each encoder block includes a multi-head self-attention mechanism, residual connections, normalization, and a one-dimensional convolutional neural network (1D CNN). The multi-head attention mechanism enables enhanced extraction of cross-positional information. Residual connections are employed to address the issues of gradient vanishing and explosion. Normalization adjusts the mean and variance of the input values of each layer to achieve uniformity, thereby enhancing the convergence efficiency. The specific model structure is shown in
Figure 3b.
To establish closer connections between adjacent Latin letters, a one-dimensional convolutional neural network (1D CNN) with ReLU activation is employed instead of fully connected networks. A 1D CNN is particularly effective for processing sequential data, such as text sequences.
For example, in Tibetan speech synthesis, consider the Latin representation “bkra shis”, which is the result of transliterating Tibetan text into Latin letters. A 1D CNN applies convolutional filters to this sequence of Latin letters. Using a filter of length three, the network slides over the sequence and processes overlapping subsets such as “bkr”, “kra”, “ra s”, and so on. This convolutional operation enables the model to capture local dependencies and patterns within the sequence, including specific combinations of letters or syllables. This capability to learn from adjacent characters or syllables is essential for generating accurate and natural Tibetan speech.
2.2.2. Predictor Architecture
The duration predictor is composed of a two-layer 1D convolutional architecture, featuring a ReLU activation layer, a normalization layer, and a dropout layer following each 1D convolutional layer. Subsequently, an additional linear layer transforms the hidden features into a scalar corresponding to the number of characters, indicating the number of frames in which each character is pronounced. Inputs to the duration predictor include the hidden features from the character encoder, while the outputs are the actual pronunciation durations of each character. The structure of the duration predictor is shown in
Figure 3c.
In addition to the duration predictor, this paper introduces pitch and energy predictors structured similarly to the aforementioned duration predictor. The primary goal of the pitch predictor is to accurately forecast contour changes in fundamental frequency (F0) within speech signals, which effectively convey different emotional states. To extract F0, data processing from the raw audio waveform is employed using a hop size identical to that of the target Mel spectrogram to ensure alignment between F0 time frames and Mel spectrogram frames. Initially, F0 is computed at each time frame and then quantized into 256 possible values representing different pitch levels. Subsequently, the quantized F0 is transformed into a sequence of one-hot vectors for further processing.
The energy predictor determines the volume of sound signals by computing the L2 norm of the short-time Fourier transform (STFT) for each frame. This method converts each frame of sound signal into a numerical value reflecting its intensity or loudness, which is quantized into 256 continuous scale values.
During the training phase, pitch and energy information calculated directly from the original speech are utilized. The predictors’ outputs are optimized by comparing their mean squared errors with real F0 and energy values. In the inference phase, trained pitch and energy predictors are used to predict F0 and energy for each time frame.
The introduction of these predictors enables our end-to-end system to capture pitch and energy variations within speech signals more accurately, thereby enhancing the naturalness and expressiveness of speech synthesis. Future research will continue to refine these predictors, exploring more effective feature extraction and model training strategies to further improve the quality and performance of speech synthesis.
2.2.3. Length Regulator
Suppose there is a sequence of character hidden representations:
, which satisfies
, where
m represents the spectrum length,
represents the length regulator, and
represents the character duration frames. The hidden feature for spectrum generation,
is derived from the following equation:
where the parameter
determines the length of the extended sequence, thereby regulating the rate of speech. In the inference stage, this paper uses the duration generated by a duration predictor as the input.
To illustrate this process more clearly, consider the example shown in
Figure 4. The figure demonstrates the transformation from
to
using the length regulator and the character duration frames. In this specific example, when
and
, the length regulator adjusts the sequence by repeating each character frame according to the corresponding duration, resulting in the final sequence in
.
2.2.4. Character-to-Letter Attention Module
This study introduces a character-to-letter attention module that extends the hidden features at the Latin letter level. Specifically,
serves as the query vector
, with
functioning as both the key
and the value
. The formula for calculating the attention weights is as follows:
where
represents the learnable transformation matrix, and
i denotes the attentional weight on the hidden feature of the
i-th Latin letter. Then, using the computed attentional weights
, the output for each query vector is calculated as follows:
where
represents another learnable transformation matrix. To date, this study has achieved a Latin letter representation matching the spectrum’s length
. To ensure that the query vector only accesses key-value pairs within the same character, a mask matrix based on the character boundary information from the input text encoder is constructed. This matrix is then incorporated into the attention weight
matrix. Additionally, to account for the monotonic alignment of Latin letters with speech, positional encoding information is added to
,
, and
. For
and
, this positional encoding information is
, where
i represents the position of each Latin letter within its corresponding character,
represents the total number of Latin letters in character
c, and
is a learnable vector embedding. For
, the position encoding information is
, where
j represents the frame’s position within the character,
represents the total number of frames in the character, and
is another learnable vector embedding.
To illustrate this process,
Figure 5 shows the calculation principle of character-to-letter attention. This figure helps to visualize how the attention weights
are computed and applied to obtain the final output
.
2.3. Decoder Architecture
The decoder architecture is the same as the character encoder architecture. First, the decoder receives vectors encoded with positional information by the mixture alignment encoder. These vectors contain positional and contextual information of the input data. Next, the decoder passes these vectors to the Mel linear module. The role of this module is to linearly project the hidden states into the Mel spectrogram, thereby generating the corresponding speech features. The Mel spectrogram is an important feature representation in speech synthesis. It mimics the human ear’s perception by applying a nonlinear transformation to the speech spectrum. Finally, the Mel spectrogram output by the decoder is used to generate high-quality speech signals.
2.4. HiFi-GAN Vocoder
To optimize both synthesis speed and quality, this study utilizes the HiFi-GAN [
23] vocoder to convert the speech spectrum into waveform audio more efficiently and with enhanced fidelity. The HiFi-GAN vocoder comprises a generator and two discriminators. The core of the generator is a convolutional neural network that processes the Mel spectrum as input and produces outputs analogous to real speech. The discriminators continuously up-sample the Mel spectrum until the output sequence matches the time-domain resolution of the original waveform, assessing the authenticity of the synthesized speech. Upon completion of training, the generator is employed directly during synthesis to produce the speech waveform signal.
2.5. Pre-Training Method
Currently, the amount of available Tibetan language corpus is relatively limited, and relying solely on parallel Tibetan corpora for system training yields suboptimal results. Inspired by transfer learning, this paper proposes a pre-training method to address this issue.
In the context of deep neural network-based transfer learning methods, fine-tuning is one of the simplest and most convenient approaches. Fine-tuning involves pre-training a model on one task and then adapting it to a related task by training it on new data specific to the target domain. This method significantly reduces training time and leverages features already learned by the pre-trained model, thereby enhancing performance on the new task. While fine-tuning does increase some training costs, its significant advantage lies in accelerating the training process, reducing the amount of data required and improving the final model’s performance. These benefits typically outweigh the additional costs. For resource-scarce languages like Tibetan, this pre-training method is particularly valuable. Compared to English and Chinese, Tibetan has very limited linguistic resources. By pre-training a model on a more resource-rich language and then fine-tuning it with Tibetan data, we can effectively utilize the extensive resources available for other languages, thereby improving the performance of Tibetan speech synthesis systems. This approach not only compensates for the lack of Tibetan data but also ensures the system performs well even with limited data.
Specifically, we first train an English speech synthesis system using “text-to-audio” data from a large English dataset. For this purpose, we used the publicly available LJSpeech dataset [
24], which contains high-quality text-to-audio pairs that have been carefully selected and processed to ensure suitability for pre-training. In the initial training phase, we train the FastSpeech2 and mixture alignment FastSpeech2 models using the LJSpeech dataset. This process includes initializing the model with predefined parameters and conducting several training iterations to gradually learn the mapping between English text and corresponding speech. During training, we use the loss functions mean absolute error (MAE) and mean squared error (MSE) to optimize the model parameters. To prevent overfitting, we periodically evaluate the model’s performance using a validation set.
After completing the initial training with the English dataset, we proceed with transfer learning. In this phase, we keep certain encoder and decoder weights fixed, particularly those layers that have effectively captured speech features. We then retrain the model using “text-to-audio” data in the target language to adapt the model to the new language characteristics. By adjusting the weights of the unfrozen layers, the model gradually learns the features of the target language.
This transfer learning method not only accelerates the decoder’s learning of speech feature information but also swiftly establishes correspondence between phonetic and textual units in the target language, akin to those in the source language. Furthermore, it effectively leverages encoder information to expedite the establishment of the language model and the extraction of text information.
3. Experiments and Results
3.1. Datasets
In this study, we used two datasets to train and pre-train the model: the Tibetan Lhasa dialect dataset and the LJSpeech dataset [
24].
We utilized the Tibetan multi-dialect speech dataset from the Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE at Minzu University of China to train the model. In this experiment, we employed the 16 kHz, 16-bit mono Tibetan Lhasa dialect dataset. This dataset, which includes recordings from four different speakers, has a total duration of approximately 7.47 h and contains 6452 speech samples, with an average duration of approximately 4 s per sentence.
To further improve the model’s performance, we also used the LJSpeech dataset for pre-training. LJSpeech is a widely used, open, single-speaker speech dataset, featuring recordings of a female speaker. This dataset has a total duration of approximately 24 h, containing 13,100 speech samples, with an average duration of 6.57 s per sentence. The audio specifications are 22.05 kHz, 16-bit mono.
For consistency, we downsampled the LJSpeech audio to 16 kHz to match the sampling rate of the Tibetan Lhasa dialect dataset. This ensures uniformity in audio processing across both pre-training and training stages.
In the training process, we used two GPUs for parallel processing, with the batch size set to 32. The output of the acoustic model is an eighty-dimensional Mel spectrogram, which is converted into audio samples using the pre-trained HiFi-GAN vocoder. The generated audio signals have a sampling frequency of 16 kHz.
3.2. Parameter Setup
In deep learning neural networks, both hyperparameters and network architecture are crucial factors. Hyperparameters refer to those parameters that need to be manually set, and they significantly influence the model’s performance and convergence speed. Improper choices can lead to issues such as failure to converge or overfitting. Therefore, selecting appropriate hyperparameters is paramount. The hyperparameters and network architecture settings in this paper are shown in the
Table 1.
3.3. Feature Prediction
The Mel spectrogram visualizes the effects of synthesized speech, and the more detailed the Mel spectrogram, the higher the potential quality of the synthesized audio. Mel spectrograms synthesized by each model are depicted in
Figure 6. We observed that the mixture alignment FastSpeech2 model exhibits an advantage in predicting Mel spectrogram details.
In the analysis of the attention alignment graph, dots represent attention weight values; the higher these values, the brighter the dots. The alignment graph reflects the stability of the speech synthesis model. The clarity of the lines in the graph indicates higher accuracy of speech synthesis and enhanced model stability. The alignment graph generated by the mixture alignment FastSpeech2 model is depicted in
Figure 7, its clear and bright lines indicate good alignment capabilities.
3.4. Model Loss
In this section, we analyze the training losses for the mixture alignment FastSpeech2 model, focusing on duration, energy, and pitch losses. The overall model loss, which includes these components, is also discussed.
The losses for duration, energy, and pitch are calculated using the mean squared error (MSE) function:
where
represents the true value,
is the predicted value, and
N is the number of samples.
3.4.1. Duration Loss
Duration loss measures the error in predicting the duration of characters:
where
is the predicted duration, and
D is the ground truth duration. Duration loss converges quickly and typically remains lower than energy and pitch losses.
3.4.2. Energy Loss
Energy loss quantifies the discrepancy in predicted energy levels:
where
is the predicted energy, and
E is the ground truth energy. Energy loss generally follows a trend similar to pitch loss.
3.4.3. Pitch Loss
Pitch loss assesses the error in predicting pitch values:
where
is the predicted pitch, and
P is the ground truth pitch. Compared to duration and energy losses, pitch loss exhibits significant oscillations.
3.4.4. Overall Model Loss
The overall model loss decreases as training progresses, indicating improved model performance. As training advances, the individual component losses stabilize, and the reduction in overall loss reflects the model’s enhanced ability to synthesize speech accurately.
The trends in training losses are shown in
Figure 8, which illustrates the changes in duration loss, energy loss, pitch loss, and overall model loss during the training process.
3.5. Evaluation Indicators and Experimental Results
We employed subjective and objective evaluation methods to evaluate synthesized speech.
3.5.1. Subjective Evaluation Indicators
The subjective assessment method, the mean opinion score (MOS), is commonly utilized. The MOS assessment criteria are shown in
Table 2.
For the subjective evaluation, five Tibetan university students were invited and ten Tibetan texts were randomly selected. They were asked to score the same text for listening, and were provided the speech synthesized with different models using the original speech as a reference. Ultimately, the average score from all the listeners was calculated as the final result.
3.5.2. Subjective Evaluation Results
Subjectively, ten Tibetan texts were randomly selected to compare the speech synthesis effects of FastSpeech2 and Mixture Alignment FastSpeech2. Furthermore, experiments utilized pre-trained models on the LJSpeech dataset. Subsequently, five Tibetan participants rated the synthesized speech using different models. The MOS was then calculated, and results are displayed in
Table 3. In the first row of the table, “Ground Truth” refers to the original recordings of the selected Tibetan texts used as a reference for evaluating the quality of synthesized speech.
To assess the effectiveness of the speech synthesis models, we first conducted an analysis of variance (ANOVA) to determine if there were any statistically significant differences among the mean opinion scores (MOS) of the various models. The ANOVA results revealed a significant overall difference (F-value = 11.90, p < 0.001), indicating that at least one of the models differed significantly from the others.
To further investigate which specific pairs of models exhibited significant differences, we performed a post-hoc Tukey HSD test. The results of this test are summarized in
Table 4. The table shows the adjusted
p-values for comparisons between each model and the mixture alignment FastSpeech2+ pre-trained model.
The Tukey HSD test revealed that the mixture alignment FastSpeech2+ pre-trained model showed a significant improvement over both FastSpeech2 (p = 0.0065) and the FastSpeech2+ pre-trained model (p = 0.0065), confirming its superior performance.
Following the ANOVA, we performed a post-hoc Tukey HSD test to identify which specific pairs of models exhibited significant differences. The Tukey HSD results are summarized as follows: the mixture alignment FastSpeech2+ pre-trained model demonstrated a significant improvement over both FastSpeech2 (p = 0.0065) and the FastSpeech2+ pre-trained model (p = 0.0065), confirming its superior performance.
3.5.3. Objective Assessment Method
In the objective evaluation, this paper uses two key metrics: the root mean square error (RMSE) and the real-time factor (RTF) of the speech synthesis process.
Root Mean Square Error (RMSE)
RMSE measures the differences between the original and synthesized speech in the time domain sequence. It quantifies the average magnitude of errors between the predicted and observed values. The calculation method is depicted in Equation (
9):
where
and
represent the speech sequences of the original speech and the synthesized speech at moment
t. The smaller the
RMSE value, the greater the similarity between the original and synthesized speeches, and the higher the quality of the synthesized speech.
Real-Time Factor (RTF)
The real-time factor (
RTF) measures the efficiency of the speech synthesis process. It represents the ratio of the time required to process a given duration of audio to the actual duration of the audio. The
RTF is calculated using the following equation:
where “Processing Time” is the total time taken by the system, including all components, to process the audio, and “Audio Duration” is the length of the audio that is being processed. A smaller
RTF value indicates more efficient synthesis, meaning the system processes speech faster and is more suitable for real-time applications.
3.5.4. Objective Evaluation Results
The evaluation results for the objective metrics are presented in the following tables.
RMSE Results
Table 5 shows the
RMSE results for different models. A smaller
RMSE value indicates better performance of the model in terms of synthesizing speech that closely matches the original speech.
RTF Results
Table 6 presents the
RTF results for the various models. A smaller
RTF value indicates more efficient synthesis, with the model requiring less time to synthesize audio.
As indicated in
Table 5 and
Table 6, the
RMSE values for the mixture alignment FastSpeech2 model decreased, reflecting improved accuracy in speech synthesis. The utilization of the pre-trained model on the English dataset resulted in a smaller
RMSE compared to models without pre-training, enhancing the quality of synthesized speech. Additionally, the
RTF values demonstrate that the mixture alignment FastSpeech2+ pre-trained model processes speech more efficiently, achieving the lowest
RTF value of 0.186. This indicates that this model is not only more accurate but also more efficient in real-time speech synthesis applications.
Training Time and Inference Speed
Table 7 compares the training times and inference speeds of different models. This comparison highlights the practical aspects of model deployment and efficiency.
When comparing the Transformer TTS and mixture alignment FastSpeech2 models, the latter demonstrates significant advantages in both training time and inference speed, as detailed in
Table 7. Mixture alignment FastSpeech2 requires only 45 h for training, markedly less than the 122 h needed by Transformer TTS. This reduction is largely due to the autoregressive nature of Transformer TTS, which necessitates a longer training period to capture detailed speech data, whereas mixture alignment FastSpeech2 employs a more efficient alignment mechanism. In terms of inference speed, mixture alignment FastSpeech2 outperforms Transformer TTS with a real-time factor of 0.193 compared to 0.984 for Transformer TTS, highlighting its superior efficiency. Although mixture alignment FastSpeech2 has a slightly longer training time compared to FastSpeech2 (37 h), this is due to the additional computations and adjustments required by its mixed alignment mechanism. Nonetheless, this added complexity results in a substantial increase in inference speed, making mixture alignment FastSpeech2 a more practical and efficient choice for real-world applications.
Both subjective and objective evaluation results indicate that using a pre-trained model has led to higher quality and naturalness in speech synthesis. We speculate that the reasons for this improvement may be as follows: firstly, the pre-trained model adeptly assimilates intricate feature representations through its training on the LJSpeech dataset, seamlessly transferring these acquired insights to the realm of Tibetan speech synthesis. Secondly, the pre-trained model demonstrates a knack for absorbing pertinent audio and acoustic nuances inherent in the speech, thereby enhancing its capacity to faithfully capture the underlying structure and essence of the spoken content. Lastly, the pre-trained model exhibits a commendable propensity for generalization, empowering it to navigate through uncharted data realms and complex scenarios with aplomb.
3.6. Ablation Studies
To understand the contributions of different components in our mixture alignment FastSpeech2 model, we conducted ablation studies. Specifically, we evaluated the impact of the pitch predictor, energy predictor, and mixture alignment mechanism on the overall performance. We conducted comparative mean opinion score (CMOS) evaluation for these ablation studies. The results are shown in
Table 8. We found that removing the energy predictor in mixture alignment FastSpeech2 results in a performance drop in terms of voice quality, indicating that the energy predictor is effective in improving voice quality. Additionally, removing the pitch predictor results in CMOS decreases of −0.11, demonstrating the effectiveness of the pitch predictor. When both pitch and energy predictors are removed, the voice quality drops even further, indicating that both pitch and energy predictors are crucial for enhancing the performance of mixture alignment FastSpeech2.
To validate the effectiveness of the mixture alignment mechanism, we compared mixture alignment FastSpeech2 with mixture alignment against FastSpeech2 with hard alignment. The experimental results are shown in row 5 of
Table 8. The results indicate that mixture alignment FastSpeech2 performed better. This finding confirms the effectiveness of the mixture alignment mechanism.
4. Conclusions
This study presented an end-to-end model incorporating a mixture alignment mechanism based on FastSpeech2 for Tibetan speech synthesis. Firstly, the Tibetan text underwent preprocessing, with the characters transcribed into Latin letters following the Wylie transcription rules. Then, the mixture alignment FastSpeech2 model was proposed to mitigate errors arising from the rigid alignment of Latin letters. Additionally, to bolster the model’s generalization capabilities, the synthesis model underwent training using an English dataset before transferring the pre-trained model to Tibetan speech synthesis.
Experimental results demonstrate significant improvements in the synthesized speech quality achieved by the mixed alignment FastSpeech2 model. Despite the observed oscillations during the learning of pitch and energy features, these provide directions for further optimization and improvement. Future efforts will focus on integrating features such as pitch, energy, and speech styles to enhance the naturalness and expressiveness of synthesized speech.
Future research will explore personalized speech synthesis by integrating speaker information into models to customize generated speech based on individual characteristics and styles. Personalized speech synthesis not only tailors speech generation to speaker attributes but also enhances the user’s experience by meeting specific needs.
With advancements in artificial intelligence and deep learning technologies, large models demonstrate immense potential in speech synthesis. Models like GPT-4, T5, and BERT possess strong learning and generalization capabilities, better capturing and expressing language complexity. Future applications aim to further improve Tibetan speech synthesis quality and effectiveness by leveraging these large models. By incorporating multilingual training and cross-lingual learning, these models will utilize diverse data resources from other languages to further enhance Tibetan speech synthesis.
Addressing technical challenges will involve exploring data augmentation techniques, model optimization strategies, and enhancing computational efficiency. These efforts will elevate Tibetan speech synthesis technical standards and broaden its applications in education, cultural dissemination, and intelligent assistants. In conclusion, large models show promising application prospects in Tibetan speech synthesis. Continuous innovation and optimization aim to contribute possibilities and opportunities for Tibetan language development and dissemination.