1. Introduction
Automatic Speech Recognition (ASR) is a technology that allows humans to interact with machine interfaces via voice. Over the past few decades, speech recognition technology has developed rapidly and has many applications in various fields such as banking and healthcare systems. However, current SOTA models are all based on large databases of large languages. Of the world’s 6000+ languages, less than 1% of them have sufficient speech corpora to support the development of speech technology [
1]. This has resulted in the unbalanced development of speech technology today. Many minority languages do not even have a complete speech database, let alone an ASR system for their own language.
The Tu nationality is one of the ethnic groups with a relatively small population in China, distributed in the eastern part of Qinghai Province and the western part of Gansu Province [
2]. Since there is no standard written form, the inheritance and development of the Tu dialect is mainly based on word of mouth. As of the present moment, research on the Tu language has been conducted only on its vocabulary and pronunciation [
3,
4,
5], and there is no complete and standardized Tu corpus. In this paper, we establish a new Tu-speech corpus of Huzhu County (HZ-TuDs) and use the latest deep learning methods in the field of Automatic Speech Recognition to study end-to-end (E2E) ASR of the Tu language. All models were trained from scratch using our corpora. By improving the base model, we greatly improve the recognition accuracy. We address the challenge of dialect inheritance in Tu-language speech recognition by building a deep neural model for the Tu dialect [
6]. This study, therefore, aimed to carry out the following:
Create a new dataset for the Tu language to provide data and experimental support for subsequent Tu dialect research.
Design a Tu–ASR system to facilitate communication between elderly people who do not speak Mandarin.
Protect the culture of the Tu nationality and inherit the language and the civilization of the Tu nationality.
2. Datasets
The Tu language, also known as the Monguor language, belongs to the Altaic language family of the Mongolian language, divided into Huzhu, Minhe, and Tongren, three distinct dialect areas. The dialects differ in aspects of their pronunciation, prosody, lexicon, morphology, and syntax. Due to the lack of standard pronunciation rules and the non-existence of a standardized written form, the pronunciation gap between different regions is very large. The people who can speak the Tu language are becoming older and facing the crisis of losing their heritage. For this reason, we focus on research on Tu language daily dialogue in Huzhu county.
In the investigation process, there were even some phenomena that people from two different villages in Huzhu County could not understand some dialect spoken by each other. In order to solve this difference in pronunciation, we preliminarily agreed on the text material according to Lu Wenzhong’s
Printed version of Daily Language of the Tu Nationality, which contains 308 common words and expressions of the Tu nationality. After that, we cooperated with the people living in Huzhu county in three different towns. To further correct and revise our dataset, we obtained a version of
Daily Expressions of the Tu Nationality in Huzhu county (HZ-Tu Daily Document). A total of 288 words and sentences were obtained after pruning. Each sentence was annotated with Chinese characters following Lu Wenzhong’s example, which was used as the pronunciation standard of the Tu language of Huzhu County and the collection standard of the HZ-TuDs corpus.
Figure 1 presents a selection of sentences for illustration purposes.
2.1. Pronunciation Rules
Even in Huzhu County, the dialect is divided into three dialect areas by local residents: Wushi town, Danma town and Dongshan town. The pronunciation of these three towns is not all the same. Therefore, we selected the residents who had been living in these three villages for reasons of minimizing the impact of accent crossover due to the family environment. Furthermore, we let them discuss the phonetic transcription of the original document,
Printed version of Daily Language of the Tu Nationality. We obtained sentences with different pronunciations in the three villages, as shown in
Figure 2, and deleted these different sentences. Finally, we established the unified pronunciation standard of the Tu language of Huzhu county, which is shown in the “HZ-Tu Daily Document”.
This file was also used as the transcribed text of subsequent dialect speech recognition.
Table 1 shows some specific information of the file.
2.2. Data Acquisition
The speakers of HZ-TuDs1 are all proficient in using the Tu language for daily communication, including university students of the Tu nationality in Huzhu County, teachers in higher vocational colleges, indigenous media broadcasters, enterprise staff, sales staff, doctors, and local language researchers, with ages ranging from 18 to 55. In addition, the main family members of all speakers must also have been living in Huzhu County, because the Tu language is passed on orally, and there is no writing, which leads to the living environment having a great impact on the pronunciation of the Tu language. Through the above steps, we ensured that the recordings spoken by the personnel participating are in Huzhu county accent, and there are no other regional accents. The speakers needed to read in strict accordance with the pronunciation marked in the HZ-Tu Daily Document to ensure the uniformity and standardization of the final data for the Tu language’s speech. Each piece of data was sampled by professional computer recording software named JZSoundRecorder(v3.6.2). The sampling criteria are shown in
Table 2.
The naming format of the audio file was the abbreviation of the speaker’s name plus the serial number, in which the serial number is consistent with the serial number in the HZ-Tu Daily Document, so as to ensure that the corresponding Chinese meaning of the Tu language can be indexed through the serial number.
Figure 3 displays the format of the audio files. We gained a total of 8190 s (2.2 h) of data.
Because of the small number of people who can fluently speak sentences of standard living in Huzhu, we used the Github GPT-SoVITS model [
7] to clone the timbre of the speakers to expand the HZ-TuDs1 corpus and increase the diversity. We selected a man and a woman who best met the pronunciation standard from HZ-TuDs1, and used their recording files as the reproduction standard to obtain HZ-TuDs2. The preliminary HZ-TuDs corpus was obtained by merging HZ-TuDs1 and HZ-TuDs2. Finally, all speaker information is shown in
Table 3.
2.3. Data Process
After obtaining HZ-TuDs, we further performed manual screening to improve the quality of the dataset. The screening conditions were as follows: (1) delete audio files with strong current and strong noise; (2) delete obvious mispronunciations; (3) delete unplayable voices. Following the aforementioned procedures, the ultimate dataset, HZ-TuDs, was acquired.
Table 4 shows us the details of the HZ-TuDs dataset.
3. Methods
This section describes how we preprocessed the data and several baseline models for speech recognition. It also introduces our proposed new module as well as our new model.
3.1. Data Preprocessing
3.1.1. Audio File Processing
We leveraged the librosa.features.rms() and librosa.power_to_db() functions from the librosa library to calculate the decibel value of the speech segments, and then compared them against the preset silence detection threshold to identify the silent regions and extract the retained audio portions. In this study, we set the silence detection threshold to the default value of −40 dB. Similarly, we employed the default noise suppression factor of 0.5 in librosa to denoise the speech signals.
3.1.2. Transcript Processing
The TXT transcript was loaded into a DataFrame, which is a 2D heterogeneous tabular data structure that contains audio file information such as the audio path, audio transcript, and audio ID. We then used Keras’s Tokenizer function by importing “keras.preprocessing.text import Tokenizer”, which then represented a dictionary of a text corpus containing rows and two columns. The first column of the dictionary refers to the word in the text data, and the second column indicates its corresponding ID. Therefore, our model uses this dictionary as a lookup table during the decoding phase. This is shown in
Table 5.
3.1.3. Feature Extraction
The steps of audio feature extraction are shown in
Figure 4. We divided the audio signal into a series of frames, then set the frame shift as 10 ms and the frame length as 25 ms. Then, we applied short-time Fourier transform to the audio frame with a frame length of 16,000 samples. Finally, we obtained the spectrogram through a set of triangular filter banks with Mel scale 161. The spectrogram image of the audio file was used as the input and sent to the CNN input layer after extracting the feature map of each speech frame.
3.2. Models
End-to-end (E2E) models have been shown to outperform state-of-the-art conventional models for streaming speech recognition [
8]. In this paper, all our methods are based on E2E models. One is the GRU-based DeepSpeech2 model. DeepSpeech2 [
9] provides a deep learning-based architecture that gives promising results in English and Mandarin [
10]. The other is a conformer, which has been taken as the state-of-the-art E2E ASR technology because of its consistent excellent performance over a wide range of ASR tasks [
11]. We used the two typical excellent models in speech recognition as a comparison, and selected the model with higher accuracy as our baseline model. This was followed by a description of our proposed new module and an illustration of how the new module is integrated into the base model.
3.2.1. E2E GRU
The first base model is the GRU-based encoder–decoder seq2seq model [
12]. GRU is widely used in E2E-ASR as an upgraded version of LSTM [
13]. The proposed model is shown in
Figure 5. The encoder is composed of GRU and batch normalization, and the decoder is mainly composed of GRU and an attention mechanism. The input is the spectrogram, and the output is the predicted character.
3.2.2. SA-Conv Module
Inspired by the success of SimAM [
14], we propose SA-conformer with a new module called the SA-Conv module. In this model, we added the SimAM module combined with the convolution module to the conformer to replace the original convolution module.
Other attention mechanisms generate 1D or 2D weights and treat neurons in each channel or spatial location equally, which may limit their capability to learn more discriminative cues. SimAM refines features with full 3D weights by estimating the importance of individual neurons. In addition, SimAM is different from other attention mechanisms; it is based on the well-established spatial suppression theory, from the perspective of biological neurology, more in line with the original design inspiration of deep learning.
In neuroscience, activating neurons rich in information inhibits surrounding neurons, known as spatial inhibition, and these neurons with spatial inhibition have higher importance. The minimum energy formula is shown in Equation (
1):
where
=
and
=
. Here,
t and
are the target neuron and other neurons in a single channel of the input feature
.
i is the index over the spatial dimension and
is the number of neurons on that channel.
is a parameter set by yourself when coding.
The lower the energy, the more different the neuron is from the surrounding neurons and the higher the importance. Therefore, the importance of neurons can be obtained by . This is the principle of the attention mechanism of SimAm. The core idea of SimAM is based on the local self-similarity of images. In an image, neighboring pixels usually have strong similarity with each other, while the similarity between distant pixels is weak. SimAM exploits this property to generate attention weights by calculating the similarity between each pixel and its neighbors in the feature map.
In speech recognition, we usually treat the input spectral features as images, and the recognition of speech signals is equivalent to the image classification problem in image recognition. Based on this idea, we combined the convolution module with SimAM to obtain the new SA-Conv module. The specific structure of the SA-Conv module is shown in
Figure 6. After the input vector undergoes point-wise convolution and depth-wise separation convolution [
15], the preliminary feature vector is obtained. The cosine similarity calculation is further performed on the obtained feature vector to obtain the final context signal feature vector weighted with the SimAM weight. Compared with the convolution module in the original conformer, the SA-Conv module can better represent a speech signal, extract the correlation between the frames before and after the speech signal, and improve the accuracy of the final predicted text.
3.2.3. SA-Conformer
Conformer achieved state-of-the-art performance in speech recognition [
16]. As a variant of transformer [
17], conformer uses convolutional modules to enforce the local dependencies of sequences. The vanilla conformer module consists of a convolutional module, a multi-head attention module, and two independent feed-forward network layers.
In our SA-conformer, we replaced the original convolutional module with the proposed SA-Conv module, and the rest of the modules remain unchanged. As shown in
Figure 7, our conformer module starts with the FFN, followed by the multi-head attention module to capture the interaction information at different positions of the sequence. Then, the SA-Conv module captures deeper features, and another FFN is deployed as the last module followed by layer normalization.
4. Experiments and Results
We developed different end-to-end Tu–ASR models and compared their results. First, we show the results of the base Tu–ASR model. Second, we present the improved results based on the SA-conformer model. The results show that the accuracy of the model is greatly improved in our proposed model.
4.1. Experimental Setup
The experiments were conducted in the HZ-TuDs corpus, from which 4.3 h of audio were used for training. The remaining 0.9 and 1 h were used as the test and validation sets, respectively.
The encoder has 12 layers, each with a convolutional kernel size of 15, a model dimension of 256, and 2 feed-forward layers with a dimension of 2048. Dropout [
18] was set to 0.1 for all layers. The decoder is a standard transformer decoder with six layers; each transformer layer has FFN layers of size 2048, four attention heads, and dropout was also set to 0.1 for all layers. The loss is calculated as the CTC [
19] loss, shown in
Figure 8, which gives the gradient of updating the model weights. The optimizer for gradient descent was Adam [
20], with a learning rate of 0.001 and 75,000 warmup steps. During the inference process of training the model, we fed the audio features into the model to obtain the softmax output, which was then decoded by CTC beam search [
21]. We trained the model on a computer equipped with an NVIDIA GTX 1080 Ti GPU (NVIDIA, Santa Clara, CA, USA). We evaluated the performance of our model using the most common ASR metric, namely CER, which measures the character-level error rate between the model-generated text and the reference text. The CER is explained in Equation (
2), where
I = total number of insertions,
D = total number of deletions,
S = total number of substitutions, and
N = total number of input words. The ASR model performs better with lower CER values.
4.2. Results
We trained the baseline model conformer from scratch on HZ-TuDs to determine the parameters and compared them using the activation functions Relu and Swish [
22,
23], respectively. Due to the small size of the dataset, we also set the number of different attention heads for comparison. The results show that changing the number of attention heads is beneficial.
Table 6 shows character error rate (CER) for different activation functions and different numbers of attention heads. Therefore, we chose Relu as the activation function of the baseline model and set the number of attention heads to four.
We then trained the two baseline models as well as the proposed new model from scratch on HZ-TuDs.
Table 7 presents the results obtained with different models. After observing
Table 7, we can find that the E2E GRU model has a higher error. This phenomenon is due to the addition of the attention module and the convolution module to the conformer module, which makes the extraction of audio features and the relationship between sequences more sensitive, so we chose conformer as the improved baseline model. It can be found that after replacing the convolutional module in conformer with our proposed SA-Conv module, the character error rate is reduced by 11%.
Table 8 illustrates some examples of each of our trained models’ prediction on the test data. Our proposed model shows good performance with both short sentences and infrequent long sentences.
5. Conclusions
We proposed a novel SA-conformer model and trained a state-of-the-art Tu–ASR model from scratch. In addition, we started from scratch; drawing on materials and with the help of the people living in Huzhu county, we established a more standardized Huzhu County-dialect corpus.
Experimental results show that our new SA-conformer model has better performance than the base conformer model, reducing the CER from 23% to 12%. This confirms that the fusion of the convolution module and the SimAM module can help the model better understand and represent speech signals and has better generalization ability. This is different from ordinary attention mechanisms that treat each neuron equally in the channel and spatial location, which may limit their capability to learn more discriminative cues. SimAM tries to find the more active neurons, that is, to more accurately find the more representative features. SimAM did help us find more important feature vectors in a speech frame, greatly improving the accuracy of our experiment.
The results we obtained are very competitive compared with similar low-resource languages [
24,
25,
26]. However, Tu-language ASR still lags far behind Chinese ASR due to the large dataset and computing power of the latter, as well as normative pronunciation guidelines.
In addition to the achievements of the model’s asr results, this paper is also of great significance in the protection of similar low-resource languages. There are many minority languages in China, similar to the Tu language, which lack written scripts and can only be passed down orally. This has also led to significant dialectal variations of these languages across different regions. The approach adopted in this research, of conducting data collection focused on a specific local area, is actually a very effective solution. Although the scope is relatively narrow, the data collected are more representative compared to dispersed samples, and it also facilitates the exploration of pronunciation rules specific to that particular region. This not only lowers the research barriers for these esoteric minority languages, but also provides concrete reference materials for subsequent scholars studying minority languages. Furthermore, our Tu Automatic Speech Recognition (ASR) system integrates the regional language with scientific and technological methods, aiming to expand the dissemination channels of Tu culture and better promote the Tu minority language. This interdisciplinary approach helps overcome the limitations of traditional single-technology efforts in serving minority language needs. We hope to further boost the Tu language’s influence in the new media environment through strengthening its digital representation, contributing to the internationalization of Tu culture.
Author Contributions
Conceptualization, S.K.; supervision, C.L.; methodology, S.K. and C.F.; investigation, P.Y. and C.L.; resources, S.K. and C.L.; validation S.K., C.F. and P.Y.; visualization, S.K. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the National Natural Science Foundation of China (Grant No.62166033) and the Basic Research Project of Qinghai Province, China (Grant No.2024-ZJ-788).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The data presented in this study are available on request from the corresponding authors. The data are not publicly available due to ethnic beliefs and copyrights.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Slam, W.; Li, Y.; Urouvas, N. Frontier Research on Low-Resource Speech Recognition Technology. Sensors 2023, 23, 9096. [Google Scholar] [CrossRef] [PubMed]
- Wu, J.; Ingram, C. Six decades of ethnic minority population change in China. Asian Popul. Stud. 2019, 15, 228–238. [Google Scholar] [CrossRef]
- Xiaoling, Y. A Comparative Study on the Vocabulary of Festivals in Minhe Huzhu Tu Language. Educ. Res. 2020, 2630–4686. [Google Scholar]
- Junast. Overview of the Tu language. Stud. Chin. Lang. 1964. [Google Scholar]
- Genxiong, J. A survey of the use and language attitudes of the Tu people. Chin. Mong. Stud. (Mong.) 2011, 39, 6. [Google Scholar]
- Haixia, W. On the significance of strengthening the protection of Tu language and cultural inheritance. Value Eng. 2013, 32, 2. [Google Scholar]
- Retrieval-based-Voice-Conversion-WebUI. Available online: https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI (accessed on 5 April 2024).
- Sainath, T.N.; He, Y.; Li, B.; Narayanan, A.; Pang, R.; Bruguier, A.; Chang, S.Y.; Li, W.; Alvarez, R.; Chen, Z. A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
- Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Zhu, Z. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. Comput. Sci. 2015, 48, 173–182. [Google Scholar]
- Hassan, M.K.A.; Rehmat, A.; Khan, M.U.G.; Yousaf, M.H. Improvement in Automatic Speech Recognition of South Asian Accent Using Transfer Learning of DeepSpeech2. Math. Probl. Eng. 2022, 2022, 1–12. [Google Scholar] [CrossRef]
- Zeineldeen, M.; Xu, J.; Lüscher, C.; Michel, W.; Gerstenberger, A.; Schlüter, R.; Ney, H. Conformer-Based Hybrid ASR System For Switchboard Dataset. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 7437–7441. [Google Scholar]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 2014, 2, 3104–3112. [Google Scholar]
- Khandelwal, S.; Lecouteux, B.; Besacier, L. Comparing GRU and LSTM for Automatic Speech Recognition. Ph.D. Thesis, LIG, Taito City, Tokyo, 2016. [Google Scholar]
- Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR, 2021. pp. 11863–11874. [Google Scholar]
- Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
- Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Freitag, M.; Al-Onaizan, Y. Beam search strategies for neural machine translation. arXiv 2017, arXiv:1702.01806. [Google Scholar]
- Agarap, A.F. Deep learning using rectified linear units (relu). arXiv 2018, arXiv:1803.08375. [Google Scholar]
- Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for activation functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
- Nasr, S.; Duwairi, R.; Quwaider, M. End-to-end speech recognition for arabic dialects. Arab. J. Sci. Eng. 2023, 48, 10617–10633. [Google Scholar] [CrossRef]
- Dhakal, M.; Chhetri, A.; Gupta, A.K.; Lamichhane, P.; Pandey, S.; Shakya, S. Automatic speech recognition for the Nepali language using CNN, bidirectional LSTM and ResNet. In Proceedings of the 2022 International Conference on Inventive Computation Technologies (ICICT), Lalitpur, Nepal, 20–22 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 515–521. [Google Scholar]
- Lonergan, L.; Qian, M.; Chiaráin, N.N.; Gobl, C.; Chasaide, A.N. Towards dialect-inclusive recognition in a low-resource language: Are balanced corpora the answer? arXiv 2023, arXiv:2307.07295. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).