Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (19)

Search Parameters:
Keywords = pronunciation problem

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
14 pages, 633 KiB  
Article
Orofacial Muscle Strength and Associated Potential Factors in Healthy Korean Community-Dwelling Older Adults: A Pilot Cross-Sectional Study
by Da-Som Lee, Ji-Youn Kim and Jun-Seon Choi
Appl. Sci. 2024, 14(22), 10560; https://doi.org/10.3390/app142210560 - 15 Nov 2024
Viewed by 809
Abstract
Most previous studies on orofacial muscle strength have focused on older adults with conditions associated with sensorimotor deficits, such as stroke. However, the modifiable oral health factors that directly impact orofacial muscle strength and swallowing ability in healthy older adults remain unexplored. This [...] Read more.
Most previous studies on orofacial muscle strength have focused on older adults with conditions associated with sensorimotor deficits, such as stroke. However, the modifiable oral health factors that directly impact orofacial muscle strength and swallowing ability in healthy older adults remain unexplored. This pilot study explored the potential factors associated with orofacial muscle strength, particularly oral health conditions, in 70 healthy adults aged ≥65 years living independently without any diseases that cause dysphagia or sensorimotor deficits. The Iowa Oral Performance Instrument (IOPI) was used to assess orofacial muscle strength (tongue elevation, and cheek and lip compression). Statistical analyses were conducted using an independent t-test, one-way ANOVA, and multivariate linear regression. In the final regression models after adjustment, older age and fewer remaining teeth were significantly associated with reduced tongue and cheek strengths (p < 0.05). Socio-demographic factors, such as age, and oral health conditions, such as discomfort in pronunciation or mastication due to oral problems, poor self-rated oral health, and reduced salivary flow, were associated with tongue, cheek, and lip muscle strengths (p < 0.05). Early active oral health interventions can help prevent a decline in orofacial muscle strength in healthy older adults. Full article
(This article belongs to the Special Issue Oral Diseases and Clinical Dentistry)
Show Figures

Figure 1

15 pages, 483 KiB  
Article
Oral State and Salivary Cortisol in Multiple Sclerosis Patients
by Aleksandra Kapel-Reguła, Justyna Chojdak-Łukasiewicz, Anna Rybińska, Irena Duś-Ilnicka and Małgorzata Radwan-Oczko
Biomedicines 2024, 12(10), 2277; https://doi.org/10.3390/biomedicines12102277 - 8 Oct 2024
Viewed by 1132
Abstract
Background: MS patients experience gradual and progressive functional limitation, bulbar symptoms, cognitive dysfunction, and psychiatric disorders that can impinge on oral status. This study aimed to investigate the oral state, oral hygiene habits, and salivary cortisol levels in patients with relapsing-remitting multiple sclerosis [...] Read more.
Background: MS patients experience gradual and progressive functional limitation, bulbar symptoms, cognitive dysfunction, and psychiatric disorders that can impinge on oral status. This study aimed to investigate the oral state, oral hygiene habits, and salivary cortisol levels in patients with relapsing-remitting multiple sclerosis (RRMS) compared to healthy controls. It also evaluated systemic parameters: disease duration, type of Disease Modifying Therapy (DMT), disability score, professional activity, and smoking in the study group. Methods: This study included 101 patients (71 women and 30 men, aged 16–71 years) and 51 healthy volunteers (36 women and 15 men, aged 28–82 years). The oral examination assessed the number of teeth, type and number of dental fillings and prosthetic restoration, oral hygiene state, and salivary cortisol. Results: It was found that MS patients had significantly more professional activity, swallowing problems, pronunciation issues, dry mouth, and taste disturbances than the control group. They brushed their teeth twice daily significantly less often. The API was higher, while the SBI was lower in MS patients. Disease duration positively correlated with age and number of missing teeth. The Expanded Disability Status Scale positively correlated with age, disease duration, number of missing teeth, number of composite fillings, and right and left-hand Nine Hole Peg test scores, and negatively correlated with the Sulcus Bleeding Index. Salivary cortisol levels did not differ between groups and correlated only with the disability scale. Conclusions: MS patients require ongoing dental care and preventive measures to manage both general and oral health symptoms effectively. Full article
Show Figures

Figure 1

23 pages, 3932 KiB  
Review
A Survey of Automatic Speech Recognition for Dysarthric Speech
by Zhaopeng Qian and Kejing Xiao
Electronics 2023, 12(20), 4278; https://doi.org/10.3390/electronics12204278 - 16 Oct 2023
Cited by 10 | Viewed by 4807
Abstract
Dysarthric speech has several pathological characteristics, such as discontinuous pronunciation, uncontrolled volume, slow speech, explosive pronunciation, improper pauses, excessive nasal sounds, and air-flow noise during pronunciation, which differ from healthy speech. Automatic speech recognition (ASR) can be very helpful for speakers with dysarthria. [...] Read more.
Dysarthric speech has several pathological characteristics, such as discontinuous pronunciation, uncontrolled volume, slow speech, explosive pronunciation, improper pauses, excessive nasal sounds, and air-flow noise during pronunciation, which differ from healthy speech. Automatic speech recognition (ASR) can be very helpful for speakers with dysarthria. Our research aims to provide a scoping review of ASR for dysarthric speech, covering papers in this field from 1990 to 2022. Our survey found that the development of research studies about the acoustic features and acoustic models of dysarthric speech is nearly synchronous. During the 2010s, deep learning technologies were widely applied to improve the performance of ASR systems. In the era of deep learning, many advanced methods (such as convolutional neural networks, deep neural networks, and recurrent neural networks) are being applied to design acoustic models and lexical and language models for dysarthric-speech-recognition tasks. Deep learning methods are also used to extract acoustic features from dysarthric speech. Additionally, this scoping review found that speaker-dependent problems seriously limit the generalization applicability of the acoustic model. The scarce available speech data cannot satisfy the amount required to train models using big data. Full article
(This article belongs to the Special Issue Advanced Natural Language Processing Technology and Applications)
Show Figures

Figure 1

17 pages, 1757 KiB  
Article
End-to-End Mispronunciation Detection and Diagnosis Using Transfer Learning
by Linkai Peng, Yingming Gao, Rian Bao, Ya Li and Jinsong Zhang
Appl. Sci. 2023, 13(11), 6793; https://doi.org/10.3390/app13116793 - 2 Jun 2023
Cited by 6 | Viewed by 3481
Abstract
As an indispensable module of computer-aided pronunciation training (CAPT) systems, mispronunciation detection and diagnosis (MDD) techniques have attracted a lot of attention from academia and industry over the past decade. To train robust MDD models, this technique requires massive human-annotated speech recordings which [...] Read more.
As an indispensable module of computer-aided pronunciation training (CAPT) systems, mispronunciation detection and diagnosis (MDD) techniques have attracted a lot of attention from academia and industry over the past decade. To train robust MDD models, this technique requires massive human-annotated speech recordings which are usually expensive and even hard to acquire. In this study, we propose to use transfer learning to tackle the problem of data scarcity from two aspects. First, from audio modality, we explore the use of the pretrained model wav2vec2.0 for MDD tasks by learning robust general acoustic representation. Second, from text modality, we explore transferring prior texts into MDD by learning associations between acoustic and textual modalities. We propose textual modulation gates that assign more importance to the relevant text information while suppressing irrelevant text information. Moreover, given the transcriptions, we propose an extra contrastive loss to reduce the difference of learning objectives between the phoneme recognition and MDD tasks. Conducting experiments on the L2-Arctic dataset showed that our wav2vec2.0 based models outperformed the conventional methods. The proposed textual modulation gate and contrastive loss further improved the F1-score by more than 2.88% and our best model achieved an F1-score of 61.75%. Full article
(This article belongs to the Special Issue Advances in Speech and Language Processing)
Show Figures

Figure 1

18 pages, 972 KiB  
Article
Non-Autoregressive End-to-End Neural Modeling for Automatic Pronunciation Error Detection
by Md. Anwar Hussen Wadud, Mohammed Alatiyyah and M. F. Mridha
Appl. Sci. 2023, 13(1), 109; https://doi.org/10.3390/app13010109 - 22 Dec 2022
Cited by 11 | Viewed by 3275
Abstract
A crucial element of computer-assisted pronunciation training systems (CAPT) is the mispronunciation detection and diagnostic (MDD) technique. The provided transcriptions can act as a teacher when evaluating the pronunciation quality of finite speech. The preceding texts have been entirely employed by conventional approaches, [...] Read more.
A crucial element of computer-assisted pronunciation training systems (CAPT) is the mispronunciation detection and diagnostic (MDD) technique. The provided transcriptions can act as a teacher when evaluating the pronunciation quality of finite speech. The preceding texts have been entirely employed by conventional approaches, such as forced alignment and extended recognition networks, for model development or for enhancing system performance. The incorporation of earlier texts into model training has recently been attempted using end-to-end (E2E)-based approaches, and preliminary results indicate efficacy. Attention-based end-to-end models have shown lower speech recognition performance because multi-pass left-to-right forward computation constrains their practical applicability in beam search. In addition, end-to-end neural approaches are typically data-hungry, and a lack of non-native training data will frequently impair their effectiveness in MDD. To solve this problem, we provide a unique MDD technique that uses non-autoregressive (NAR) end-to-end neural models to greatly reduce estimation time while maintaining accuracy levels similar to traditional E2E neural models. In contrast, NAR models can generate parallel token sequences by accepting parallel inputs instead of left-to-right forward computation. To further enhance the effectiveness of MDD, we develop and construct a pronunciation model superimposed on our approach’s NAR end-to-end models. To test the effectiveness of our strategy against some of the best end-to-end models, we use publicly accessible L2-ARCTIC and SpeechOcean English datasets for training and testing purposes where the proposed model shows the best results than other existing models. Full article
(This article belongs to the Special Issue Deep Learning for Speech Processing)
Show Figures

Figure 1

24 pages, 3512 KiB  
Article
Mispronunciation Detection and Diagnosis with Articulatory-Level Feedback Generation for Non-Native Arabic Speech
by Mohammed Algabri, Hassan Mathkour, Mansour Alsulaiman and Mohamed A. Bencherif
Mathematics 2022, 10(15), 2727; https://doi.org/10.3390/math10152727 - 2 Aug 2022
Cited by 15 | Viewed by 3879
Abstract
A high-performance versatile computer-assisted pronunciation training (CAPT) system that provides the learner immediate feedback as to whether their pronunciation is correct is very helpful in learning correct pronunciation and allows learners to practice this at any time and with unlimited repetitions, without the [...] Read more.
A high-performance versatile computer-assisted pronunciation training (CAPT) system that provides the learner immediate feedback as to whether their pronunciation is correct is very helpful in learning correct pronunciation and allows learners to practice this at any time and with unlimited repetitions, without the presence of an instructor. In this paper, we propose deep learning-based techniques to build a high-performance versatile CAPT system for mispronunciation detection and diagnosis (MDD) and articulatory feedback generation for non-native Arabic learners. The proposed system can locate the error in pronunciation, recognize the mispronounced phonemes, and detect the corresponding articulatory features (AFs), not only in words but even in sentences. We formulate the recognition of phonemes and corresponding AFs as a multi-label object recognition problem, where the objects are the phonemes and their AFs in a spectral image. Moreover, we investigate the use of cutting-edge neural text-to-speech (TTS) technology to generate a new corpus of high-quality speech from predefined text that has the most common substitution errors among Arabic learners. The proposed model and its various enhanced versions achieved excellent results. We compared the performance of the different proposed models with the state-of-the-art end-to-end technique of MDD, and our system had a better performance. In addition, we proposed using fusion between the proposed model and the end-to-end model and obtained a better performance. Our best model achieved a 3.83% phoneme error rate (PER) in the phoneme recognition task, a 70.53% F1-score in the MDD task, and a detection error rate (DER) of 2.6% for the AF detection task. Full article
(This article belongs to the Special Issue Recent Advances in Artificial Intelligence and Machine Learning)
Show Figures

Figure 1

16 pages, 25741 KiB  
Article
Speech GAU: A Single Head Attention for Mandarin Speech Recognition for Air Traffic Control
by Shiyu Zhang, Jianguo Kong, Chao Chen, Yabin Li and Haijun Liang
Aerospace 2022, 9(8), 395; https://doi.org/10.3390/aerospace9080395 - 22 Jul 2022
Cited by 11 | Viewed by 2535
Abstract
The rise of end-to-end (E2E) speech recognition technology in recent years has overturned the design pattern of cascading multiple subtasks in classical speech recognition and achieved direct mapping of speech input signals to text labels. In this study, a new E2E framework, ResNet–GAU–CTC, [...] Read more.
The rise of end-to-end (E2E) speech recognition technology in recent years has overturned the design pattern of cascading multiple subtasks in classical speech recognition and achieved direct mapping of speech input signals to text labels. In this study, a new E2E framework, ResNet–GAU–CTC, is proposed to implement Mandarin speech recognition for air traffic control (ATC). A deep residual network (ResNet) utilizes the translation invariance and local correlation of a convolutional neural network (CNN) to extract the time-frequency domain information of speech signals. A gated attention unit (GAU) utilizes a gated single-head attention mechanism to better capture the long-range dependencies of sequences, thus attaining a larger receptive field and contextual information, as well as a faster training convergence rate. The connectionist temporal classification (CTC) criterion eliminates the need for forced frame-level alignments. To address the problems of scarce data resources and unique pronunciation norms and contexts in the ATC field, transfer learning and data augmentation techniques were applied to enhance the robustness of the network and improve the generalization ability of the model. The character error rate (CER) of our model was 11.1% on the expanded Aishell corpus, and it decreased to 8.0% on the ATC corpus. Full article
Show Figures

Figure 1

16 pages, 1749 KiB  
Article
Learning the Relative Dynamic Features for Word-Level Lipreading
by Hao Li, Nurbiya Yadikar, Yali Zhu, Mutallip Mamut and Kurban Ubul
Sensors 2022, 22(10), 3732; https://doi.org/10.3390/s22103732 - 13 May 2022
Cited by 2 | Viewed by 2478
Abstract
Lipreading is a technique for analyzing sequences of lip movements and then recognizing the speech content of a speaker. Limited by the structure of our vocal organs, the number of pronunciations we could make is finite, leading to problems with homophones when speaking. [...] Read more.
Lipreading is a technique for analyzing sequences of lip movements and then recognizing the speech content of a speaker. Limited by the structure of our vocal organs, the number of pronunciations we could make is finite, leading to problems with homophones when speaking. On the other hand, different speakers will have various lip movements for the same word. For these problems, we focused on the spatial–temporal feature extraction in word-level lipreading in this paper, and an efficient two-stream model was proposed to learn the relative dynamic information of lip motion. In this model, two different channel capacity CNN streams are used to extract static features in a single frame and dynamic information between multi-frame sequences, respectively. We explored a more effective convolution structure for each component in the front-end model and improved by about 8%. Then, according to the characteristics of the word-level lipreading dataset, we further studied the impact of the two sampling methods on the fast and slow channels. Furthermore, we discussed the influence of the fusion methods of the front-end and back-end models under the two-stream network structure. Finally, we evaluated the proposed model on two large-scale lipreading datasets and achieved a new state-of-the-art. Full article
(This article belongs to the Topic Human Movement Analysis)
Show Figures

Figure 1

9 pages, 329 KiB  
Article
Prevalence and Determinants of Hoarseness in School-Aged Children
by Ahmed Alrahim, Askar K. Alshaibani, Saad Algarni, Abdulmalik Alsaied, Amal A. Alghamdi, Salma Alsharhan and Mohammad Al-Bar
Int. J. Environ. Res. Public Health 2022, 19(9), 5468; https://doi.org/10.3390/ijerph19095468 - 30 Apr 2022
Cited by 5 | Viewed by 2300
Abstract
Hoarseness in school-aged children may affect their educational achievement and interfere with their communication and social skills development. The global prevalence of hoarseness in school-aged children ranges between 6% and 23%. To the best of our knowledge, there is a scarcity of studies [...] Read more.
Hoarseness in school-aged children may affect their educational achievement and interfere with their communication and social skills development. The global prevalence of hoarseness in school-aged children ranges between 6% and 23%. To the best of our knowledge, there is a scarcity of studies describing the prevalence or determinates of hoarseness in Saudi school-aged children. Our aim was to measure the prevalence of hoarseness among school-aged children and to identify its determinants. A cross-sectional questionnaire-based survey was used that included randomly selected primary and early childhood schools from private and governmental sectors in Saudi Arabia. The data were collected using a questionnaire which was self-completed by the children’s parents and covered the following aspects: sociodemographic features, health and its related comorbidities about children and their families, attendance and performance in school, child’s voice tone, past history of frequent crying during infancy, history of letter pronunciation problems and stuttering, the Reflux Symptom Index (RSI) and the Children’s Voice Handicap Index-10 for parents (CVHI-10-P). Determinants of hoarseness were investigated using the SPSS software (version 20). The mean age of the study children (n = 428) was 9.05 years (SD = 2.15), and 69.40% of them were male. The rate of hoarseness in the participants was 7.5%. Hoarseness was significantly common in children with a history of excessive infancy crying (p = 0.006), letter pronunciation issues (especially ‘R’ and ‘S’; p = 0.003), and stuttering (p = 0.004) and in those with a previous history of hoarseness (p = 0.023). In addition, having the symptoms of gastrointestinal reflux increased the risk of hoarseness by four times (OR = 4.77, 95% CI = 2.171, 10.51). In summary, hoarseness in children may be dangerously underestimated, as it may reflect the presence of speech problems, in addition to the presence of laryngopharyngeal reflux (LPR). Hoarseness was assumed on the basis of parental complaints. Therefore, further research with diagnoses based on a clinical assessment is needed to understand the magnitude of the hoarseness problem and its consequences in children. Full article
16 pages, 1561 KiB  
Article
An Empirical Performance Analysis of the Speak Correct Computerized Interface
by Kamal Jambi, Hassanin Al-Barhamtoshy, Wajdi Al-Jedaibi, Mohsen Rashwan and Sherif Abdou
Processes 2022, 10(3), 487; https://doi.org/10.3390/pr10030487 - 28 Feb 2022
Cited by 1 | Viewed by 2449
Abstract
The way in which people speak reveals a lot about where they are from, where they were raised, and also where they have recently lived. When communicating in a foreign language or second language, accents from one’s first language are likely to emerge, [...] Read more.
The way in which people speak reveals a lot about where they are from, where they were raised, and also where they have recently lived. When communicating in a foreign language or second language, accents from one’s first language are likely to emerge, giving an individual a ‘strange’ accent. This is a great and challenging problem. Not particularly, because it is a part of one’s personality that they do not have to give up. It is only challenging when pronunciation causes a disruption in communication between an individual and the individuals with whom they are speaking. Making oneself understandable is the goal of perfecting English pronunciations. Many people require their pronunciation to be perfect, such as those individuals working in the healthcare industry, where it is rather critical that each term be read precisely. Speak Correct offers each of its users a service that assists them with any English pronunciation concerns that may arise. Some of the pronunciation improvements will only apply to a specific customer’s dictionary; however, in some cases, the modifications can be applied to the standard dictionary as well, benefiting our whole customer base. Speak Correct is a computerized linguist interface that can assist its users in many different places around the world with their English pronunciation issues due to Saudi or Egyptian accents. In this study, the authors carry out an empirical investigation of the Speak Correct computerized interface to assess its performance. The results of this research reveal that Speak Correct is highly effective at delivering pronunciation correction. Full article
(This article belongs to the Special Issue Recent Advances in Machine Learning and Applications)
Show Figures

Figure 1

20 pages, 4596 KiB  
Article
Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition
by Sanghun Jeon, Ahmed Elsharkawy and Mun Sang Kim
Sensors 2022, 22(1), 72; https://doi.org/10.3390/s22010072 - 23 Dec 2021
Cited by 22 | Viewed by 6442
Abstract
In visual speech recognition (VSR), speech is transcribed using only visual information to interpret tongue and teeth movements. Recently, deep learning has shown outstanding performance in VSR, with accuracy exceeding that of lipreaders on benchmark datasets. However, several problems still exist when using [...] Read more.
In visual speech recognition (VSR), speech is transcribed using only visual information to interpret tongue and teeth movements. Recently, deep learning has shown outstanding performance in VSR, with accuracy exceeding that of lipreaders on benchmark datasets. However, several problems still exist when using VSR systems. A major challenge is the distinction of words with similar pronunciation, called homophones; these lead to word ambiguity. Another technical limitation of traditional VSR systems is that visual information does not provide sufficient data for learning words such as “a”, “an”, “eight”, and “bin” because their lengths are shorter than 0.02 s. This report proposes a novel lipreading architecture that combines three different convolutional neural networks (CNNs; a 3D CNN, a densely connected 3D CNN, and a multi-layer feature fusion 3D CNN), which are followed by a two-layer bi-directional gated recurrent unit. The entire network was trained using connectionist temporal classification. The results of the standard automatic speech recognition evaluation metrics show that the proposed architecture reduced the character and word error rates of the baseline model by 5.681% and 11.282%, respectively, for the unseen-speaker dataset. Our proposed architecture exhibits improved performance even when visual ambiguity arises, thereby increasing VSR reliability for practical applications. Full article
(This article belongs to the Special Issue Future Speech Interfaces with Sensors and Machine Intelligence)
Show Figures

Figure 1

14 pages, 1685 KiB  
Article
An Improved Chinese String Comparator for Bloom Filter Based Privacy-Preserving Record Linkage
by Siqi Sun, Yining Qian, Ruoshi Zhang, Yanqi Wang and Xinran Li
Entropy 2021, 23(8), 1091; https://doi.org/10.3390/e23081091 - 22 Aug 2021
Cited by 2 | Viewed by 3152
Abstract
With the development of information technology, it has become a popular topic to share data from multiple sources without privacy disclosure problems. Privacy-preserving record linkage (PPRL) can link the data that truly matches and does not disclose personal information. In the existing studies, [...] Read more.
With the development of information technology, it has become a popular topic to share data from multiple sources without privacy disclosure problems. Privacy-preserving record linkage (PPRL) can link the data that truly matches and does not disclose personal information. In the existing studies, the techniques of PPRL have mostly been studied based on the alphabetic language, which is much different from the Chinese language environment. In this paper, Chinese characters (identification fields in record pairs) are encoded into strings composed of letters and numbers by using the SoundShape code according to their shapes and pronunciations. Then, the SoundShape codes are encrypted by Bloom filter, and the similarity of encrypted fields is calculated by Dice similarity. In this method, the false positive rate of Bloom filter and different proportions of sound code and shape code are considered. Finally, we performed the above methods on the synthetic datasets, and compared the precision, recall, F1-score and computational time with different values of false positive rate and proportion. The results showed that our method for PPRL in Chinese language environment improved the quality of the classification results and outperformed others with a relatively low additional cost of computation. Full article
Show Figures

Figure 1

19 pages, 3482 KiB  
Article
Interdisciplinary and Intercultural Development of an Early Literacy App in Dhuwaya
by Gillian Wigglesworth, Melanie Wilkinson, Yalmay Yunupingu, Robyn Beecham and Jake Stockley
Languages 2021, 6(2), 106; https://doi.org/10.3390/languages6020106 - 15 Jun 2021
Cited by 1 | Viewed by 3858
Abstract
Phonological awareness is a skill which is crucial in learning to read. In this paper, we report on the challenges encountered while developing a digital application (app) for teaching phonological awareness and early literacy skills in Dhuwaya. Dhuwaya is a Yol?u language variety [...] Read more.
Phonological awareness is a skill which is crucial in learning to read. In this paper, we report on the challenges encountered while developing a digital application (app) for teaching phonological awareness and early literacy skills in Dhuwaya. Dhuwaya is a Yol?u language variety spoken in Yirrkala and surrounding areas in East Arnhem Land. Dhuwaya is the first language of the children who attend a bilingual school in which Dhuwaya and English are the languages of instruction. Dhuwaya and English have different phonemic inventories and different alphabets. The Dhuwaya alphabet is based on Roman alphabet symbols and has 31 graphemes (compared to 26 in English). The app was designed to teach children how to segment and blend syllables and phonemes and to identify common words as well as suffixes used in the language. However, the development was not straightforward, and the impact of the linguistic, cultural and educational challenges could not have been predicted. Amongst these was the inherent variation in the language, including glottal stops, the pronunciation of stops, the focus on syllables as a decoding strategy for literacy development and challenges of finding one-syllable words such as those initially used with English-speaking children. Another challenge was identifying culturally appropriate images which the children could relate to and which were not copyrighted. In this paper, we discuss these plus a range of other issues that emerged, identifying how these problems were addressed and resolved by the interdisciplinary and intercultural team. Full article
(This article belongs to the Special Issue Australian Languages Today)
Show Figures

Figure 1

19 pages, 1416 KiB  
Article
Correct Pronunciation Detection of the Arabic Alphabet Using Deep Learning
by Nishmia Ziafat, Hafiz Farooq Ahmad, Iram Fatima, Muhammad Zia, Abdulaziz Alhumam and Kashif Rajpoot
Appl. Sci. 2021, 11(6), 2508; https://doi.org/10.3390/app11062508 - 11 Mar 2021
Cited by 14 | Viewed by 8382
Abstract
Automatic speech recognition for Arabic has its unique challenges and there has been relatively slow progress in this domain. Specifically, Classic Arabic has received even less research attention. The correct pronunciation of the Arabic alphabet has significant implications on the meaning of words. [...] Read more.
Automatic speech recognition for Arabic has its unique challenges and there has been relatively slow progress in this domain. Specifically, Classic Arabic has received even less research attention. The correct pronunciation of the Arabic alphabet has significant implications on the meaning of words. In this work, we have designed learning models for the Arabic alphabet classification based on the correct pronunciation of an alphabet. The correct pronunciation classification of the Arabic alphabet is a challenging task for the research community. We divide the problem into two steps, firstly we train the model to recognize an alphabet, namely Arabic alphabet classification. Secondly, we train the model to determine its quality of pronunciation, namely Arabic alphabet pronunciation classification. Due to the less availability of audio data of this kind, we had to collect audio data from the experts, and novices for our model’s training. To train these models, we extract pronunciation features from audio data of the Arabic alphabet using mel-spectrogram. We have employed a deep convolution neural network (DCNN), AlexNet with transfer learning, and bidirectional long short-term memory (BLSTM), a type of recurrent neural network (RNN), for the classification of the audio data. For alphabet classification, DCNN, AlexNet, and BLSTM achieve an accuracy of 95.95%, 98.41%, and 88.32%, respectively. For Arabic alphabet pronunciation classification, DCNN, AlexNet, and BLSTM achieve an accuracy of 97.88%, 99.14%, and 77.71%, respectively. Full article
Show Figures

Figure 1

22 pages, 1420 KiB  
Article
Speech Processing for Language Learning: A Practical Approach to Computer-Assisted Pronunciation Teaching
by Natalia Bogach, Elena Boitsova, Sergey Chernonog, Anton Lamtev, Maria Lesnichaya, Iurii Lezhenin, Andrey Novopashenny, Roman Svechnikov, Daria Tsikach, Konstantin Vasiliev, Evgeny Pyshkin and John Blake
Electronics 2021, 10(3), 235; https://doi.org/10.3390/electronics10030235 - 20 Jan 2021
Cited by 36 | Viewed by 6434
Abstract
This article contributes to the discourse on how contemporary computer and information technology may help in improving foreign language learning not only by supporting better and more flexible workflow and digitizing study materials but also through creating completely new use cases made possible [...] Read more.
This article contributes to the discourse on how contemporary computer and information technology may help in improving foreign language learning not only by supporting better and more flexible workflow and digitizing study materials but also through creating completely new use cases made possible by technological improvements in signal processing algorithms. We discuss an approach and propose a holistic solution to teaching the phonological phenomena which are crucial for correct pronunciation, such as the phonemes; the energy and duration of syllables and pauses, which construct the phrasal rhythm; and the tone movement within an utterance, i.e., the phrasal intonation. The working prototype of StudyIntonation Computer-Assisted Pronunciation Training (CAPT) system is a tool for mobile devices, which offers a set of tasks based on a “listen and repeat” approach and gives the audio-visual feedback in real time. The present work summarizes the efforts taken to enrich the current version of this CAPT tool with two new functions: the phonetic transcription and rhythmic patterns of model and learner speech. Both are designed on a base of a third-party automatic speech recognition (ASR) library Kaldi, which was incorporated inside StudyIntonation signal processing software core. We also examine the scope of automatic speech recognition applicability within the CAPT system workflow and evaluate the Levenstein distance between the transcription made by human experts and that obtained automatically in our code. We developed an algorithm of rhythm reconstruction using acoustic and language ASR models. It is also shown that even having sufficiently correct production of phonemes, the learners do not produce a correct phrasal rhythm and intonation, and therefore, the joint training of sounds, rhythm and intonation within a single learning environment is beneficial. To mitigate the recording imperfections voice activity detection (VAD) is applied to all the speech records processed. The try-outs showed that StudyIntonation can create transcriptions and process rhythmic patterns, but some specific problems with connected speech transcription were detected. The learners feedback in the sense of pronunciation assessment was also updated and a conventional mechanism based on dynamic time warping (DTW) was combined with cross-recurrence quantification analysis (CRQA) approach, which resulted in a better discriminating ability. The CRQA metrics combined with those of DTW were shown to add to the accuracy of learner performance estimation. The major implications for computer-assisted English pronunciation teaching are discussed. Full article
(This article belongs to the Special Issue Recent Advances in Multimedia Signal Processing and Communications)
Show Figures

Figure 1

Back to TopTop