MDPI - Publisher of Open Access Journals

17 pages, 1013 KB

Open AccessArticle

SRC-IT2: Speech Rate-Controllable Mongolian Emotional Speech Synthesis Based on Improved Tacotron2

by Qingdaoerji Ren, Qian Bo, Chao Zhou, Yatu Ji and Nier Wu

Electronics 2025, 14(19), 3835; https://doi.org/10.3390/electronics14193835 - 27 Sep 2025

Cited by 1 | Viewed by 889

To address the challenges of slow synthesis speed, unstable quality, limited emotional expressiveness, and the lack of controllable speaking rate in Mongolian emotional speech synthesis, this paper proposes a speech Rate-Controllable Mongolian emotional speech synthesis model based on improved Tacotron2 (SRC-IT2). First, an [...] Read more.

To address the challenges of slow synthesis speed, unstable quality, limited emotional expressiveness, and the lack of controllable speaking rate in Mongolian emotional speech synthesis, this paper proposes a speech Rate-Controllable Mongolian emotional speech synthesis model based on improved Tacotron2 (SRC-IT2). First, an end-to-end Mongolian speech synthesis module is constructed based on an improved Tacotron2 framework, incorporating the unique linguistic characteristics of the Mongolian script. The front-end processing is optimized accordingly, and a G2P-Seq2Seq model is employed to achieve accurate grapheme-to-phoneme conversion for Mongolian characters. Next, on top of the end-to-end synthesis framework, a joint text-audio emotion analysis module is integrated to effectively learn and represent emotional style features specific to Mongolian speech. Finally, a style encoder and speaking rate control variable are embedded into the acoustic modeling process, further enhancing Tacotron2’s ability to dynamically adjust the speaking rate during emotional speech generation. Experimental results demonstrate that the proposed model produces more natural-sounding speech with improved emotional expressiveness and enables effective real-time control over speaking rate in Mongolian emotional speech synthesis. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Figure 1

14 pages, 256 KB

Open AccessReview

A Review of Neuroimaging Research of Chinese as a Second Language: Insights from the Assimilation–Accommodation Framework

by Jia Zhang, Xiaoyu Mou, Bingkun Li and Hehui Li

Behav. Sci. 2025, 15(9), 1243; https://doi.org/10.3390/bs15091243 - 12 Sep 2025

Viewed by 1639

Abstract

The assimilation–accommodation theory provides a crucial theoretical framework for understanding the neural mechanisms of second language (L2) processing. Chinese characters, as logographic scripts, contain diverse strokes and components with high visual complexity, and their grapheme–phoneme conversion differs fundamentally from alphabetic writing systems. Existing [...] Read more.

The assimilation–accommodation theory provides a crucial theoretical framework for understanding the neural mechanisms of second language (L2) processing. Chinese characters, as logographic scripts, contain diverse strokes and components with high visual complexity, and their grapheme–phoneme conversion differs fundamentally from alphabetic writing systems. Existing studies have identified unique neural patterns in Chinese language processing, yet a systematic synthesis of L2 Chinese processing remains limited. This review focuses on the brain mechanisms underlying Chinese language processing among L2 learners with diverse native language backgrounds. On the one hand, Chinese language processing relies on neural networks of the native language (assimilation); on the other hand, it recruits additional right-hemisphere regions to adapt to Chinese characters’ visuospatial complexity and grapheme–phoneme conversion strategies (accommodation). Accordingly, this review first synthesizes current brain imaging studies on L2 Chinese processing within this theoretical framework, noting that prevailing paradigms—limited to lexical and sentence-level processing—fail to capture the complexity, hierarchy, and dynamics of natural language. Next, this review examines the application and implications of naturalistic stimuli paradigms in neuroimaging research of L2 Chinese processing. Finally, future directions for this field are proposed. Collectively, these findings reveal neuroplasticity in processing complex ideographic scripts. Full article

24 pages, 1038 KB

Open AccessArticle

Eye Movements of French Dyslexic Adults While Reading Texts: Evidence of Word Length, Lexical Frequency, Consistency and Grammatical Category

by Aikaterini Premeti, Frédéric Isel and Maria Pia Bucci

Brain Sci. 2025, 15(7), 693; https://doi.org/10.3390/brainsci15070693 - 27 Jun 2025

Cited by 1 | Viewed by 1398

Abstract

Background/Objectives: Dyslexia, a learning disability affecting reading, has been extensively studied using eye movements. This study aimed to examine in the same design the effects of different psycholinguistic variables, i.e., grammatical category, lexical frequency, word length and orthographic consistency on eye movement patterns [...] Read more.

Background/Objectives: Dyslexia, a learning disability affecting reading, has been extensively studied using eye movements. This study aimed to examine in the same design the effects of different psycholinguistic variables, i.e., grammatical category, lexical frequency, word length and orthographic consistency on eye movement patterns during reading in adults. Methods: We compared the eye movements of forty university students, twenty with and twenty without dyslexia while they read aloud a meaningful and a meaningless text in order to examine whether semantic context could enhance their reading strategy. Results: Dyslexic participants made more reading errors and had longer reading time particularly with the meaningless text, suggesting an increased reliance on the semantic context to enhance their reading strategy. They also made more progressive and regressive fixations while reading the two texts. Similar results were found when examining grammatical categories. These findings suggest a reduced visuo-attentional span and reliance on a serial decoding approach during reading, likely based on grapheme-to-phoneme conversion. Furthermore, in the whole text analysis, there was no difference in fixation duration between the groups. However, when examining word length, only the control group exhibited a distinction between longer and shorter words. No significant group differences emerged for word frequency. Importantly, multiple regression analyses revealed that orthographic consistency predicted fixation durations only in the control group, suggesting that dyslexic readers were less sensitive to phonological regularities—possibly due to underlying phonological deficits. Conclusions: These findings suggest the involvement of both phonological and visuo-attentional deficits in dyslexia. Combined remediation strategies may enhance dyslexic individuals’ performance in phonological and visuo-attentional tasks. Full article

(This article belongs to the Section Developmental Neuroscience)

► Show Figures

Figure 1

20 pages, 1420 KB

Open AccessArticle

A Survey of Grapheme-to-Phoneme Conversion Methods

by Shiyang Cheng, Pengcheng Zhu, Jueting Liu and Zehua Wang

Appl. Sci. 2024, 14(24), 11790; https://doi.org/10.3390/app142411790 - 17 Dec 2024

Cited by 6 | Viewed by 10603

Abstract

Grapheme-to-phoneme conversion (G2P) is the task of converting letters (grapheme sequences) into their pronunciations (phoneme sequences). It plays a crucial role in natural language processing, text-to-speech synthesis, and automatic speech recognition systems. This paper provides a systematical overview of the G2P conversion from [...] Read more.

Grapheme-to-phoneme conversion (G2P) is the task of converting letters (grapheme sequences) into their pronunciations (phoneme sequences). It plays a crucial role in natural language processing, text-to-speech synthesis, and automatic speech recognition systems. This paper provides a systematical overview of the G2P conversion from different perspectives. The conversion methods are first presented in the paper; detailed discussions are conducted on methods based on deep learning technology. For each method, the key ideas, advantages, disadvantages, and representative models are summarized. This paper then mentioned the learning strategies and multilingual G2P conversions. Finally, this paper summarized the commonly used monolingual and multilingual datasets, including Mandarin, Japanese, Arabic, etc. Two tables illustrated the performance of various methods with relative datasets. After making a general overall of G2P conversion, this paper concluded with the current issues and the future directions of deep learning-based G2P conversion. Full article

(This article belongs to the Collection Trends and Prospects in Multimedia)

► Show Figures

Figure 1

20 pages, 4803 KB

Open AccessArticle

Near-Optimal Active Learning for Multilingual Grapheme-to-Phoneme Conversion

by Dezhi Cao, Yue Zhao and Licheng Wu

Appl. Sci. 2023, 13(16), 9408; https://doi.org/10.3390/app13169408 - 19 Aug 2023

Cited by 3 | Viewed by 2928

Abstract

The construction of pronunciation dictionaries relies on high-quality and extensive training data in data-driven way. However, the manual annotation of corpus for this purpose is both costly and time consuming, especially for low-resource languages that lack sufficient data and resources. A multilingual pronunciation [...] Read more.

The construction of pronunciation dictionaries relies on high-quality and extensive training data in data-driven way. However, the manual annotation of corpus for this purpose is both costly and time consuming, especially for low-resource languages that lack sufficient data and resources. A multilingual pronunciation dictionary includes some common phonemes or phonetic units, which means that these phonemes or units have similarities in the pronunciation of different languages and can be used in the construction process of pronunciation dictionaries for low-resource languages. By using a multilingual pronunciation dictionary, knowledge can be shared among different languages, thus improving the quality and accuracy of pronunciation dictionaries for low-resource languages. In this paper, we propose using shared articulatory features among multiple languages to construct a universal phoneme set, which is then used to label words for multiple languages. To achieve this, we first developed a grapheme−phoneme (G2P) model based on an encoder−decoder deep neural network. We then adopted a near-optimal active learning method in the process of building the pronunciation dictionary to select informative samples from a large, unlabeled corpus and had them labeled by experts. Our experiments demonstrate that this method selected about 1/5 of the unlabeled data and achieved an even higher conversion accuracy than the results of the large data training method. By selectively labeling samples with a high uncertainty in the model, while avoiding labeling samples that were accurately predicted by the current model, our method greatly enhances the efficiency of pronunciation dictionary construction. Full article

(This article belongs to the Special Issue AI Technology and Application in Various Industries)

► Show Figures

Figure 1

27 pages, 496 KB

Open AccessArticle

A Rule-Based Grapheme-to-Phoneme Conversion System

by Piotr Kłosowski

Appl. Sci. 2022, 12(5), 2758; https://doi.org/10.3390/app12052758 - 7 Mar 2022

Cited by 10 | Viewed by 6855

Abstract

This article presents a rule-based grapheme-to-phoneme conversion method and algorithm for Polish. It should be noted that the fundamental grapheme-to-phoneme conversion rules have been developed by Maria Steffen-Batóg and presented in her set of monographs dedicated to the automatic grapheme-to-phoneme conversion of texts [...] Read more.

This article presents a rule-based grapheme-to-phoneme conversion method and algorithm for Polish. It should be noted that the fundamental grapheme-to-phoneme conversion rules have been developed by Maria Steffen-Batóg and presented in her set of monographs dedicated to the automatic grapheme-to-phoneme conversion of texts in Polish. The author used previously developed rules and independently developed the grapheme-to-phoneme conversion algorithm.The algorithm has been implemented as a software application called TransFon, which allows the user to convert any text in Polish orthography to corresponding strings of phonemes, in phonemic transcription. Using TransFon, a phonemic Polish language corpus was created out of an orthographic corpus. The phonemic language corpusallows statistical analysis of the Polish language, as well as the development of phoneme- and word-based language models for automatic speech recognition using statistical methods. The developed phonemic language corpus opens up further opportunities for research to improve automatic speech recognition in Polish. The development of statistical methods for speech recognition and language modelling requires access to large language corpora, including phonemic corpora. The method presented here enables the creation of such corpora. Full article

(This article belongs to the Special Issue Automatic Speech Recognition)

► Show Figures

Figure 1

9 pages, 1277 KB

Open AccessArticle

Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis

by Xiao Zhou, Zhenhua Ling, Yajun Hu and Lirong Dai

Appl. Sci. 2021, 11(21), 10475; https://doi.org/10.3390/app112110475 - 8 Nov 2021

Viewed by 2588

Abstract

An encoder–decoder with attention has become a popular method to achieve sequence-to-sequence (Seq2Seq) acoustic modeling for speech synthesis. To improve the robustness of the attention mechanism, methods utilizing the monotonic alignment between phone sequences and acoustic feature sequences have been proposed, such as [...] Read more.

An encoder–decoder with attention has become a popular method to achieve sequence-to-sequence (Seq2Seq) acoustic modeling for speech synthesis. To improve the robustness of the attention mechanism, methods utilizing the monotonic alignment between phone sequences and acoustic feature sequences have been proposed, such as stepwise monotonic attention (SMA). However, the phone sequences derived by grapheme-to-phoneme (G2P) conversion may not contain the pauses at the phrase boundaries in utterances, which challenges the assumption of strictly stepwise alignment in SMA. Therefore, this paper proposes to insert hidden states into phone sequences to deal with the situation that pauses are not provided explicitly, and designs a semi-stepwise monotonic attention (SSMA) to model these inserted hidden states. In this method, hidden states are introduced that absorb the pause segments in utterances in an unsupervised way. Thus, the attention at each decoding frame has three options, moving forward to the next phone, staying at the same phone, or jumping to a hidden state. Experimental results show that SSMA can achieve better naturalness of synthetic speech than SMA when phrase boundaries are not available. Moreover, the pause positions derived from the alignment paths of SSMA matched the manually labeled phrase boundaries quite well. Full article

(This article belongs to the Special Issue Selected Papers from 16th National Conference on Man-Machine Speech Communication (NCMMSC2021))

► Show Figures

Figure 1

14 pages, 1236 KB

Open AccessArticle

DUAL-tDCS Treatment over the Temporo-Parietal Cortex Enhances Writing Skills: First Evidence from Chronic Post-Stroke Aphasia

by Francesca Pisano, Carlo Caltagirone, Chiara Incoccia and Paola Marangolo

Life 2021, 11(4), 343; https://doi.org/10.3390/life11040343 - 14 Apr 2021

Cited by 8 | Viewed by 3997

Abstract

The learning of writing skills involves the re-engagement of previously established independent procedures. Indeed, the writing deficit an adult may acquire after left hemispheric brain injury is caused by either an impairment to the lexical route, which processes words as a whole, to [...] Read more.

The learning of writing skills involves the re-engagement of previously established independent procedures. Indeed, the writing deficit an adult may acquire after left hemispheric brain injury is caused by either an impairment to the lexical route, which processes words as a whole, to the sublexical procedure based on phoneme-to-grapheme conversion rules, or to both procedures. To date, several approaches have been proposed for writing disorders, among which, interventions aimed at restoring the sub-lexical procedure were successful in cases of severe agraphia. In a randomized double-blind crossover design, fourteen chronic Italian post-stroke aphasics underwent dual transcranial direct current stimulation (tDCS) (20 min, 2 mA) with anodal and cathodal current simultaneously placed over the left and right temporo-parietal cortex, respectively. Two different conditions were considered: (1) real, and (2) sham, while performing a writing task. Each experimental condition was performed for ten workdays over two weeks. After real stimulation, a greater amelioration in writing with respect to the sham was found. Relevantly, these effects generalized to different language tasks not directly treated. This evidence suggests, for the first time, that dual tDCS associated with training is efficacious for severe agraphia. Our results confirm the critical role of the temporo-parietal cortex in writing skills. Full article

(This article belongs to the Special Issue Brain Function, Dysfunction and Post-Damage Reorganization, Two Decades of Research)

► Show Figures

Figure 1

22 pages, 4592 KB

Open AccessArticle

Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings

by Eshete Derb Emiru, Shengwu Xiong, Yaxing Li, Awet Fesseha and Moussa Diallo

Information 2021, 12(2), 62; https://doi.org/10.3390/info12020062 - 3 Feb 2021

Cited by 19 | Viewed by 7132

Abstract

Out-of-vocabulary (OOV) words are the most challenging problem in automatic speech recognition (ASR), especially for morphologically rich languages. Most end-to-end speech recognition systems are performed at word and character levels of a language. Amharic is a poorly resourced but morphologically rich language. This [...] Read more.

Out-of-vocabulary (OOV) words are the most challenging problem in automatic speech recognition (ASR), especially for morphologically rich languages. Most end-to-end speech recognition systems are performed at word and character levels of a language. Amharic is a poorly resourced but morphologically rich language. This paper proposes hybrid connectionist temporal classification with attention end-to-end architecture and a syllabification algorithm for Amharic automatic speech recognition system (AASR) using its phoneme-based subword units. This algorithm helps to insert the epithetic vowel እ[ɨ], which is not included in our Grapheme-to-Phoneme (G2P) conversion algorithm developed using consonant–vowel (CV) representations of Amharic graphemes. The proposed end-to-end model was trained in various Amharic subwords, namely characters, phonemes, character-based subwords, and phoneme-based subwords generated by the byte-pair-encoding (BPE) segmentation algorithm. Experimental results showed that context-dependent phoneme-based subwords tend to result in more accurate speech recognition systems than the character-based, phoneme-based, and character-based subword counterparts. Further improvement was also obtained in proposed phoneme-based subwords with the syllabification algorithm and SpecAugment data augmentation technique. The word error rate (WER) reduction was 18.38% compared to character-based acoustic modeling with the word-based recurrent neural network language modeling (RNNLM) baseline. These phoneme-based subword models are also useful to improve machine and speech translation tasks. Full article

(This article belongs to the Special Issue Natural Language Processing for Social Media)

► Show Figures

Figure 1

17 pages, 19053 KB

Open AccessArticle

Grapheme-to-Phoneme Conversion with Convolutional Neural Networks

by Sevinj Yolchuyeva, Géza Németh and Bálint Gyires-Tóth

Appl. Sci. 2019, 9(6), 1143; https://doi.org/10.3390/app9061143 - 18 Mar 2019

Cited by 35 | Viewed by 20483

Abstract

Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form. It has a highly essential role for natural language processing, text-to-speech synthesis and automatic speech recognition systems. In this paper, we investigate convolutional neural networks (CNN) for [...] Read more.

Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form. It has a highly essential role for natural language processing, text-to-speech synthesis and automatic speech recognition systems. In this paper, we investigate convolutional neural networks (CNN) for G2P conversion. We propose a novel CNN-based sequence-to-sequence (seq2seq) architecture for G2P conversion. Our approach includes an end-to-end CNN G2P conversion with residual connections and, furthermore, a model that utilizes a convolutional neural network (with and without residual connections) as encoder and Bi-LSTM as a decoder. We compare our approach with state-of-the-art methods, including Encoder-Decoder LSTM and Encoder-Decoder Bi-LSTM. Training and inference times, phoneme and word error rates were evaluated on the public CMUDict dataset for US English, and the best performing convolutional neural network-based architecture was also evaluated on the NetTalk dataset. Our method approaches the accuracy of previous state-of-the-art results in terms of phoneme error rate. Full article

(This article belongs to the Special Issue Recent Advances on Signal Processing and Deep Learning for Public Security Applications)

► Show Figures

Figure 1

Search Results (10)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (10)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI