Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (240)

Search Parameters:
Keywords = phoneme

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
33 pages, 1363 KB  
Article
A Cross-Language Investigation of Stimulus- and Person-Level Characteristics That Determine Phonemic Processing in Monolingual French- and German-Speaking Preschoolers
by Jessica Carolyn Weiner-Bühler, Katrin Skoruppa, Leila Teresa Schächinger Tenés, Robin Klaus Segerer and Alexander Grob
Languages 2025, 10(10), 261; https://doi.org/10.3390/languages10100261 (registering DOI) - 12 Oct 2025
Abstract
Phonemic processing is largely influenced by how stimulus-specific characteristics of a language are computed, but person-level variables represent important moderators as well. The current study investigates how such characteristics, in parallel, affect receptive-level phonemic processing across the preschool age, and whether these effects [...] Read more.
Phonemic processing is largely influenced by how stimulus-specific characteristics of a language are computed, but person-level variables represent important moderators as well. The current study investigates how such characteristics, in parallel, affect receptive-level phonemic processing across the preschool age, and whether these effects are comparable across different languages. Using a child-friendly ‘odd-man-out’ discrimination task, we examined 239 monolingual German- and French-speaking preschoolers, aged three to five. Results revealed that phonotactic probability-based syllable frequency, nonword length, and mismatching nonword positioning effects explained independent variance components of phonemic processing. Age significantly affected how memory-related, but not linguistically relevant, stimulus characteristics were utilized for phonemic processing. Additionally, cross-language differences in rhythmic structure between German and French influenced which nonword segments received more attention focus. These findings provide novel insights into critical determinants of phonemic processing in preschoolers and highlight the need for further research to explore these effects over time and within varying language backgrounds. Full article
Show Figures

Figure 1

29 pages, 1708 KB  
Article
Speech Recognition and Synthesis Models and Platforms for the Kazakh Language
by Aidana Karibayeva, Vladislav Karyukin, Balzhan Abduali and Dina Amirova
Information 2025, 16(10), 879; https://doi.org/10.3390/info16100879 - 10 Oct 2025
Viewed by 14
Abstract
With the rapid development of artificial intelligence and machine learning technologies, automatic speech recognition (ASR) and text-to-speech (TTS) have become key components of the digital transformation of society. The Kazakh language, as a representative of the Turkic language family, remains a low-resource language [...] Read more.
With the rapid development of artificial intelligence and machine learning technologies, automatic speech recognition (ASR) and text-to-speech (TTS) have become key components of the digital transformation of society. The Kazakh language, as a representative of the Turkic language family, remains a low-resource language with limited audio corpora, language models, and high-quality speech synthesis systems. This study provides a comprehensive analysis of existing speech recognition and synthesis models, emphasizing their applicability and adaptation to the Kazakh language. Special attention is given to linguistic and technical barriers, including the agglutinative structure, rich vowel system, and phonemic variability. Both open-source and commercial solutions were evaluated, including Whisper, GPT-4 Transcribe, ElevenLabs, OpenAI TTS, Voiser, KazakhTTS2, and TurkicTTS. Speech recognition systems were assessed using BLEU, WER, TER, chrF, and COMET, while speech synthesis was evaluated with MCD, PESQ, STOI, and DNSMOS, thus covering both lexical–semantic and acoustic–perceptual characteristics. The results demonstrate that, for speech-to-text (STT), the strongest performance was achieved by Soyle on domain-specific data (BLEU 74.93, WER 18.61), while Voiser showed balanced accuracy (WER 40.65–37.11, chrF 80.88–84.51) and GPT-4 Transcribe achieved robust semantic preservation (COMET up to 1.02). In contrast, Whisper performed weakest (WER 77.10, BLEU 13.22), requiring further adaptation for Kazakh. For text-to-speech (TTS), KazakhTTS2 delivered the most natural perceptual quality (DNSMOS 8.79–8.96), while OpenAI TTS achieved the best spectral accuracy (MCD 123.44–117.11, PESQ 1.14). TurkicTTS offered reliable intelligibility (STOI 0.15, PESQ 1.16), and ElevenLabs produced natural but less spectrally accurate speech. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

44 pages, 882 KB  
Article
A Comparative Perspective on Language Shift and Language Change: Norwegian and German Heritage Varieties in North America
by Alexander K. Lykke and Maike H. Rocker
Languages 2025, 10(10), 256; https://doi.org/10.3390/languages10100256 - 30 Sep 2025
Viewed by 534
Abstract
This study evaluates the relationship between language shift and linguistic change in multigenerational immigrant communities, focusing on North American Norwegian (NAmNo) and German heritage varieties. The research synthesizes current findings on how language shift impacts linguistic structures in moribund heritage varieties. Methods include [...] Read more.
This study evaluates the relationship between language shift and linguistic change in multigenerational immigrant communities, focusing on North American Norwegian (NAmNo) and German heritage varieties. The research synthesizes current findings on how language shift impacts linguistic structures in moribund heritage varieties. Methods include a qualitative review of diachronic studies, comparing data from different periods to assess changes in tense morphology, language mixing, compositional definiteness, possessive placement, verb placement, argument placement, and phoneme variation. Results indicate that the last generation of heritage speakers demonstrates increased linguistic innovation and variation compared to earlier generations. Key findings show that language shift leads to different input quality and quantity, affecting grammatical stability. The study concludes that sociocultural changes, such as verticalization and domain-specific language use, significantly influence heritage language maintenance and loss. These insights contribute to understanding the dynamics of language shift and its role in heritage language change, offering valuable comparative perspectives across different immigrant communities. Full article
Show Figures

Figure 1

17 pages, 1013 KB  
Article
SRC-IT2: Speech Rate-Controllable Mongolian Emotional Speech Synthesis Based on Improved Tacotron2
by Qingdaoerji Ren, Qian Bo, Chao Zhou, Yatu Ji and Nier Wu
Electronics 2025, 14(19), 3835; https://doi.org/10.3390/electronics14193835 - 27 Sep 2025
Viewed by 261
Abstract
To address the challenges of slow synthesis speed, unstable quality, limited emotional expressiveness, and the lack of controllable speaking rate in Mongolian emotional speech synthesis, this paper proposes a speech Rate-Controllable Mongolian emotional speech synthesis model based on improved Tacotron2 (SRC-IT2). First, an [...] Read more.
To address the challenges of slow synthesis speed, unstable quality, limited emotional expressiveness, and the lack of controllable speaking rate in Mongolian emotional speech synthesis, this paper proposes a speech Rate-Controllable Mongolian emotional speech synthesis model based on improved Tacotron2 (SRC-IT2). First, an end-to-end Mongolian speech synthesis module is constructed based on an improved Tacotron2 framework, incorporating the unique linguistic characteristics of the Mongolian script. The front-end processing is optimized accordingly, and a G2P-Seq2Seq model is employed to achieve accurate grapheme-to-phoneme conversion for Mongolian characters. Next, on top of the end-to-end synthesis framework, a joint text-audio emotion analysis module is integrated to effectively learn and represent emotional style features specific to Mongolian speech. Finally, a style encoder and speaking rate control variable are embedded into the acoustic modeling process, further enhancing Tacotron2’s ability to dynamically adjust the speaking rate during emotional speech generation. Experimental results demonstrate that the proposed model produces more natural-sounding speech with improved emotional expressiveness and enables effective real-time control over speaking rate in Mongolian emotional speech synthesis. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

31 pages, 3671 KB  
Article
Research on Wu Dialect Recognition and Regional Variations Based on Deep Learning
by Xinyi Yue, Lizhi Miao and Jiahao Ding
Appl. Sci. 2025, 15(18), 10227; https://doi.org/10.3390/app151810227 - 19 Sep 2025
Viewed by 429
Abstract
Wu dialects carry deep regional culture, but due to significant internal variations, automated recognition faces considerable challenges. This study focuses on speech recognition and semantic feedback for Wu dialects, proposing a deep learning system with regional adaptability and semantic feedback capabilities. First, a [...] Read more.
Wu dialects carry deep regional culture, but due to significant internal variations, automated recognition faces considerable challenges. This study focuses on speech recognition and semantic feedback for Wu dialects, proposing a deep learning system with regional adaptability and semantic feedback capabilities. First, a corpus covering multiple Wu dialect regions (WXDPC) is constructed, and a two-level phoneme mapping and regional difference modeling mechanism is introduced. By incorporating geographical region labels and transfer learning, the model’s performance in non-central regions is improved. Experimental results show that as the training corpus increases, the model’s CER significantly decreases. After the introduction of regional labels, the CER in non-central Wu dialect regions decreased by 4.5%, demonstrating the model’s effectiveness in complex dialect environments. This system provides technical support for the preservation and application of Wu dialects and offers valuable experience for the promotion of other dialect recognition systems. Full article
Show Figures

Figure 1

12 pages, 685 KB  
Article
Changes in Bilabial Contact Pressure as a Function of Vocal Loudness in Individuals with Parkinson’s Disease
by Jeff Searl
Appl. Sci. 2025, 15(18), 10165; https://doi.org/10.3390/app151810165 - 18 Sep 2025
Viewed by 335
Abstract
This study evaluated the impact of vocal loudness on bilabial contact pressure (BCP) during the production of bilabial English consonants in adults with Parkinson’s disease (PD). Twelve adults with PD produced sentences with the phonemes /b, p, m/ initiating a linguistically meaningful word [...] Read more.
This study evaluated the impact of vocal loudness on bilabial contact pressure (BCP) during the production of bilabial English consonants in adults with Parkinson’s disease (PD). Twelve adults with PD produced sentences with the phonemes /b, p, m/ initiating a linguistically meaningful word within the sentence, while BCP was sensed with a miniature pressure transducer positioned at the midline between the upper and lower lips. Stimuli were produced at two loudness levels: Habitual and twice as loud as habitual loudness (Loud). A linear mixed model (LMM) indicated a statistically significant main effect of Condition (F (1, 714) = 16.210, p < 0.001) with Loud having greater BCP than Habitual (mean difference of 0.593 kPa). The main effect of Phoneme was also significant (F (1, 714) = 31.905, p < 0.001), with post hoc tests revealing that BCP was significantly higher for /p/ compared to /m/ (p = 0.007), and for /b/ compared to /m/ (p = 0.002). An additional LMM of the magnitude of the percent change in BCP in the Loud condition relative to the Habitual condition had a significant main effect of Phoneme (F (2, 22.3) = 5.871, p = 0.006). The percent change in BCP was the greatest for /p/ (47.7%), followed by /b/ (35.7%) and /m/ (27.4%), with statistically significant differences for both /p/ and /b/ compared to /m/ in post hoc tests. The results indicated that changes in vocal loudness cause changes in BCP in individuals with PD. A louder voice was associated with higher BCP for all three phonemes, although the increase was the greatest on bilabial stops compared to nasal stops. These results provide initial insights regarding the mechanism by which therapeutic interventions focused on increasing loudness in people with PD alter oral articulatory behaviors. Future work that details potential aerodynamic (e.g., oral air pressure build-up) and articulatory acoustics (e.g., burst intensity) is needed to better explain the mechanistic actions of increased loudness that can explain why loud-focused speech treatments for people with PD may improve speech intelligibility. Full article
Show Figures

Figure 1

14 pages, 256 KB  
Review
A Review of Neuroimaging Research of Chinese as a Second Language: Insights from the Assimilation–Accommodation Framework
by Jia Zhang, Xiaoyu Mou, Bingkun Li and Hehui Li
Behav. Sci. 2025, 15(9), 1243; https://doi.org/10.3390/bs15091243 - 12 Sep 2025
Viewed by 430
Abstract
The assimilation–accommodation theory provides a crucial theoretical framework for understanding the neural mechanisms of second language (L2) processing. Chinese characters, as logographic scripts, contain diverse strokes and components with high visual complexity, and their grapheme–phoneme conversion differs fundamentally from alphabetic writing systems. Existing [...] Read more.
The assimilation–accommodation theory provides a crucial theoretical framework for understanding the neural mechanisms of second language (L2) processing. Chinese characters, as logographic scripts, contain diverse strokes and components with high visual complexity, and their grapheme–phoneme conversion differs fundamentally from alphabetic writing systems. Existing studies have identified unique neural patterns in Chinese language processing, yet a systematic synthesis of L2 Chinese processing remains limited. This review focuses on the brain mechanisms underlying Chinese language processing among L2 learners with diverse native language backgrounds. On the one hand, Chinese language processing relies on neural networks of the native language (assimilation); on the other hand, it recruits additional right-hemisphere regions to adapt to Chinese characters’ visuospatial complexity and grapheme–phoneme conversion strategies (accommodation). Accordingly, this review first synthesizes current brain imaging studies on L2 Chinese processing within this theoretical framework, noting that prevailing paradigms—limited to lexical and sentence-level processing—fail to capture the complexity, hierarchy, and dynamics of natural language. Next, this review examines the application and implications of naturalistic stimuli paradigms in neuroimaging research of L2 Chinese processing. Finally, future directions for this field are proposed. Collectively, these findings reveal neuroplasticity in processing complex ideographic scripts. Full article
17 pages, 1244 KB  
Article
Evidence for Language Policy in Government Pre-Primary Schools in Nigeria: Cross-Language Transfer and Interdependence
by Pauline Dixon, Steve Humble, Louise Gittins, Francesca Seery and Chris Counihan
Educ. Sci. 2025, 15(9), 1197; https://doi.org/10.3390/educsci15091197 - 11 Sep 2025
Viewed by 852
Abstract
This study explores the relationship between and within Hausa and English letter sound knowledge and word decoding skills among children studying in early years settings in northern Nigeria. There is a lack of correlational studies as well as causal evidence in the African [...] Read more.
This study explores the relationship between and within Hausa and English letter sound knowledge and word decoding skills among children studying in early years settings in northern Nigeria. There is a lack of correlational studies as well as causal evidence in the African context to indicate any transfer of language skills from L1 and L2 and vice versa. Test scores from 851 children studying in 158 government provided pre-primary schools took tests in letter sound (phoneme) and reading (word) decoding skills. Through bivariate correlations and a just-identified feedback path model, the results support Cummins’ interdependence hypothesis. Hausa and English word scores are bidirectionally associated, and the data reveal very strong significant positive correlations between Hausa and English letter sound scores and Hausa and English word scores. With the language policy set to change in Nigeria concerning the use of the language of the immediate community becoming a possible medium of instruction, these results, supporting bidirectionality and linguistic interdependence, provide evidence for the teaching of L1 and L2 in pre-primary settings in northern Nigeria. Full article
Show Figures

Figure 1

25 pages, 4660 KB  
Article
Dual-Stream Former: A Dual-Branch Transformer Architecture for Visual Speech Recognition
by Sanghun Jeon, Jieun Lee and Yong-Ju Lee
AI 2025, 6(9), 222; https://doi.org/10.3390/ai6090222 - 9 Sep 2025
Viewed by 1086
Abstract
This study proposes Dual-Stream Former, a novel architecture that integrates a Video Swin Transformer and Conformer designed to address the challenges of visual speech recognition (VSR). The model captures spatiotemporal dependencies, achieving a state-of-the-art character error rate (CER) of 3.46%, surpassing traditional convolutional [...] Read more.
This study proposes Dual-Stream Former, a novel architecture that integrates a Video Swin Transformer and Conformer designed to address the challenges of visual speech recognition (VSR). The model captures spatiotemporal dependencies, achieving a state-of-the-art character error rate (CER) of 3.46%, surpassing traditional convolutional neural network (CNN)-based models, such as 3D-CNN + DenseNet-121 (CER: 5.31%), and transformer-based alternatives, such as vision transformers (CER: 4.05%). The Video Swin Transformer captures multiscale spatial representations with high computational efficiency, whereas the Conformer back-end enhances temporal modeling across diverse phoneme categories. Evaluation of a high-resolution dataset comprising 740,000 utterances across 185 classes highlighted the effectiveness of the model in addressing visually confusing phonemes, such as diphthongs (/ai/, /au/) and labio-dental sounds (/f/, /v/). Dual-Stream Former achieved phoneme recognition error rates of 10.39% for diphthongs and 9.25% for labiodental sounds, surpassing those of CNN-based architectures by more than 6%. Although the model’s large parameter count (168.6 M) poses resource challenges, its hierarchical design ensures scalability. Future work will explore lightweight adaptations and multimodal extensions to increase deployment feasibility. These findings underscore the transformative potential of Dual-Stream Former for advancing VSR applications such as silent communication and assistive technologies by achieving unparalleled precision and robustness in diverse settings. Full article
Show Figures

Figure 1

18 pages, 2065 KB  
Article
Phoneme-Aware Augmentation for Robust Cantonese ASR Under Low-Resource Conditions
by Lusheng Zhang, Shie Wu and Zhongxun Wang
Symmetry 2025, 17(9), 1478; https://doi.org/10.3390/sym17091478 - 8 Sep 2025
Viewed by 647
Abstract
Cantonese automatic speech recognition (ASR) faces persistent challenges due to its nine lexical tones, extensive phonological variation, and the scarcity of professionally transcribed corpora. To address these issues, we propose a lightweight and data-efficient framework that leverages weak phonetic supervision (WPS) in conjunction [...] Read more.
Cantonese automatic speech recognition (ASR) faces persistent challenges due to its nine lexical tones, extensive phonological variation, and the scarcity of professionally transcribed corpora. To address these issues, we propose a lightweight and data-efficient framework that leverages weak phonetic supervision (WPS) in conjunction with two pho-neme-aware augmentation strategies. (1) Dynamic Boundary-Aligned Phoneme Dropout progressively removes entire IPA segments according to a curriculum schedule, simulating real-world phenomena such as elision, lenition, and tonal drift while ensuring training stability. (2) Phoneme-Aware SpecAugment confines all time- and frequency-masking operations within phoneme boundaries and prioritizes high-attention regions, thereby preserving intra-phonemic contours and formant integrity. Built on the Whistle encoder—which integrates a Conformer backbone, Connectionist Temporal Classification–Conditional Random Field (CTC-CRF) alignment, and a multi-lingual phonetic space—the approach requires only a grapheme-to-phoneme lexicon and Montreal Forced Aligner outputs, without any additional manual labeling. Experiments on the Cantonese subset of Common Voice demonstrate consistent gains: Dynamic Dropout alone reduces phoneme error rate (PER) from 17.8% to 16.7% with 50 h of speech and 16.4% to 15.1% with 100 h, while the combination of the two augmentations further lowers PER to 15.9%/14.4%. These results confirm that structure-aware phoneme-level perturbations provide an effective and low-cost solution for building robust Cantonese ASR systems under low-resource conditions. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

14 pages, 7196 KB  
Article
Touch to Speak: Real-Time Tactile Pronunciation Feedback for Individuals with Speech and Hearing Impairments
by Anat Sharon, Roi Yozevitch and Eldad Holdengreber
Technologies 2025, 13(8), 345; https://doi.org/10.3390/technologies13080345 - 7 Aug 2025
Viewed by 1150
Abstract
This study presents a wearable haptic feedback system designed to support speech training for individuals with speech and hearing impairments. The system provides real-time tactile cues based on detected phonemes, helping users correct their pronunciation independently. Unlike prior approaches focused on passive reception [...] Read more.
This study presents a wearable haptic feedback system designed to support speech training for individuals with speech and hearing impairments. The system provides real-time tactile cues based on detected phonemes, helping users correct their pronunciation independently. Unlike prior approaches focused on passive reception or therapist-led instruction, our method enables active, phoneme-level feedback using a multimodal interface combining audio input, visual reference, and spatially mapped vibrotactile output. We validated the system through three user studies measuring pronunciation accuracy, phoneme discrimination, and learning over time. The results show a significant improvement in word articulation accuracy and user engagement. These findings highlight the potential of real-time haptic pronunciation tools as accessible, scalable aids for speech rehabilitation and second-language learning. Full article
Show Figures

Figure 1

20 pages, 821 KB  
Article
The Role of Phoneme Discrimination in the Variability of Speech and Language Outcomes Among Children with Hearing Loss
by Kerry A. Walker, Jinal K. Shah, Lauren Alexander, Stacy Stiell, Christine Yoshinaga-Itano and Kristin M. Uhler
Behav. Sci. 2025, 15(8), 1072; https://doi.org/10.3390/bs15081072 - 6 Aug 2025
Viewed by 752
Abstract
This research compares speech discrimination abilities between 17 children who are hard-of-hearing (CHH) and 13 children with normal hearing (CNH), aged 9 to 36 months, using either a conditioned head turn (CHT) or condition play paradigm, for two phoneme pairs /ba-da/ and /sa-ʃa/. [...] Read more.
This research compares speech discrimination abilities between 17 children who are hard-of-hearing (CHH) and 13 children with normal hearing (CNH), aged 9 to 36 months, using either a conditioned head turn (CHT) or condition play paradigm, for two phoneme pairs /ba-da/ and /sa-ʃa/. As CHH were tested in the aided and unaided conditions, CNH were also tested on each phoneme contrast twice to control for learning effects. When speech discrimination abilities were compared between CHH, with hearing aids (HAs), and CNH, there were no statistical differences observed in performance on stop consonant discrimination, but a significant statistical difference was observed for fricative discrimination performance. Among CHH, significant benefits were observed for /ba-da/ speech discrimination while wearing HAs, compared to the no HA condition. All CHH were early-identified, early amplified, and were enrolled in parent-centered early intervention services. Under these conditions, CHH demonstrated the ability to discriminate speech comparable to CNH. Additionally, repeated testing within 1-month did not result in a change in speech discrimination scores, indicating good test–retest reliability of speech discrimination scores. Finally, this research explored the question of infant/toddler listening fatigue in the behavioral speech discrimination task. The CHT paradigm included returning to a contrast (i.e., /a-i/) previously shown to be easier for both CHH and CNH to discriminate to examine if failure to discriminate /ba-da/ or /sa-ʃa/ was due to listening fatigue or off-task behavior. Full article
(This article belongs to the Special Issue Language and Cognitive Development in Deaf Children)
Show Figures

Figure 1

20 pages, 367 KB  
Article
Power Dynamics and Discourse Technologies in Jordanian Colloquial Arabic Allophonic Consonant Variations
by Bassel Alzboun, Raed Al Ramahi and Nisreen Abu Hanak
Languages 2025, 10(8), 190; https://doi.org/10.3390/languages10080190 - 5 Aug 2025
Viewed by 761
Abstract
Most academic papers on Jordanian colloquial Arabic allophonic consonant variants have primarily examined their influence on the social status of speakers and their role in shaping linguistic prestige. However, there is a significant lack of research exploring the potential for manipulation and establishment [...] Read more.
Most academic papers on Jordanian colloquial Arabic allophonic consonant variants have primarily examined their influence on the social status of speakers and their role in shaping linguistic prestige. However, there is a significant lack of research exploring the potential for manipulation and establishment of power through the deliberate use of consonantal variants by Jordanian speakers in Arabic. Using a variety of allophonic consonantal variants, this study investigates how speakers of Jordanian colloquial Arabic attempt to construct their discourse of power. The targeted phonemes in the current study were /q/, /θ/, /ð/, and /k/. Focus groups were used to gather data, which were then examined within the framework of Fairclough’s technologized discourse and thematic approaches. Twenty persons, 10 women and 10 men, ranging in age from 18 to 45 years, comprised each of the two groups. The duration of each focus group session was 50 min. Analysis of the data indicates that the presence of [q], [θ], [ð], and [k] allophones in Standard Arabic is restricted to particular social circumstances, such as official and scientific environments. This usage is a common trait among those who have received formal education and privileged social standing. The findings also reveal that participants strategically utilize the allophonic variants [g], [ʔ], [k], [t̪], [d̪], and [tʃ] to exert influence over interlocutors by demonstrating authority related to social identity, gender, and emotional state. This study intends to advance discussions on allophonic consonant variants in Jordanian colloquial Arabic by providing insights into their manipulative functions. Full article
37 pages, 618 KB  
Systematic Review
Interaction, Artificial Intelligence, and Motivation in Children’s Speech Learning and Rehabilitation Through Digital Games: A Systematic Literature Review
by Chra Abdoulqadir and Fernando Loizides
Information 2025, 16(7), 599; https://doi.org/10.3390/info16070599 - 12 Jul 2025
Cited by 1 | Viewed by 1672
Abstract
The integration of digital serious games into speech learning (rehabilitation) has demonstrated significant potential in enhancing accessibility and inclusivity for children with speech disabilities. This review of the state of the art examines the role of serious games, Artificial Intelligence (AI), and Natural [...] Read more.
The integration of digital serious games into speech learning (rehabilitation) has demonstrated significant potential in enhancing accessibility and inclusivity for children with speech disabilities. This review of the state of the art examines the role of serious games, Artificial Intelligence (AI), and Natural Language Processing (NLP) in speech rehabilitation, with a particular focus on interaction modalities, engagement autonomy, and motivation. We have reviewed 45 selected studies. Our key findings show how intelligent tutoring systems, adaptive voice-based interfaces, and gamified speech interventions can empower children to engage in self-directed speech learning, reducing dependence on therapists and caregivers. The diversity of interaction modalities, including speech recognition, phoneme-based exercises, and multimodal feedback, demonstrates how AI and Assistive Technology (AT) can personalise learning experiences to accommodate diverse needs. Furthermore, the incorporation of gamification strategies, such as reward systems and adaptive difficulty levels, has been shown to enhance children’s motivation and long-term participation in speech rehabilitation. The gaps identified show that despite advancements, challenges remain in achieving universal accessibility, particularly regarding speech recognition accuracy, multilingual support, and accessibility for users with multiple disabilities. This review advocates for interdisciplinary collaboration across educational technology, special education, cognitive science, and human–computer interaction (HCI). Our work contributes to the ongoing discourse on lifelong inclusive education, reinforcing the potential of AI-driven serious games as transformative tools for bridging learning gaps and promoting speech rehabilitation beyond clinical environments. Full article
Show Figures

Graphical abstract

15 pages, 1359 KB  
Article
Phoneme-Aware Hierarchical Augmentation and Semantic-Aware SpecAugment for Low-Resource Cantonese Speech Recognition
by Lusheng Zhang, Shie Wu and Zhongxun Wang
Sensors 2025, 25(14), 4288; https://doi.org/10.3390/s25144288 - 9 Jul 2025
Viewed by 814
Abstract
Cantonese Automatic Speech Recognition (ASR) is hindered by tonal complexity, acoustic diversity, and a lack of labelled data. This study proposes a phoneme-aware hierarchical augmentation framework that enhances performance without additional annotation. A Phoneme Substitution Matrix (PSM), built from Montreal Forced Aligner alignments [...] Read more.
Cantonese Automatic Speech Recognition (ASR) is hindered by tonal complexity, acoustic diversity, and a lack of labelled data. This study proposes a phoneme-aware hierarchical augmentation framework that enhances performance without additional annotation. A Phoneme Substitution Matrix (PSM), built from Montreal Forced Aligner alignments and Tacotron-2 synthesis, injects adversarial phoneme variants into both transcripts and their aligned audio segments, enlarging pronunciation diversity. Concurrently, a semantic-aware SpecAugment scheme exploits wav2vec 2.0 attention heat maps and keyword boundaries to adaptively mask informative time–frequency regions; a reinforcement-learning controller tunes the masking schedule online, forcing the model to rely on a wider context. On the Common Voice Cantonese 50 h subset, the combined strategy reduces the character error rate (CER) from 26.17% to 16.88% with wav2vec 2.0 and from 38.83% to 23.55% with Zipformer. At 100 h, the CER further drops to 4.27% and 2.32%, yielding relative gains of 32–44%. Ablation studies confirm that phoneme-level and masking components provide complementary benefits. The framework offers a practical, model-independent path toward accurate ASR for Cantonese and other low-resource tonal languages. This paper presents an intelligent sensing-oriented modeling framework for speech signals, which is suitable for deployment on edge or embedded systems to process input from audio sensors (e.g., microphones) and shows promising potential for voice-interactive terminal applications. Full article
Show Figures

Figure 1

Back to TopTop