Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (296)

Search Parameters:
Keywords = automatic speech recognition

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
24 pages, 3568 KB  
Article
Employing AI for Better Access to Justice: An Automatic Text-to-Video Linking Tool for UK Supreme Court Hearings
by Hadeel Saadany, Constantin Orăsan, Catherine Breslin, Mikolaj Barczentewicz and Sophie Walker
Appl. Sci. 2025, 15(16), 9205; https://doi.org/10.3390/app15169205 - 21 Aug 2025
Viewed by 307
Abstract
The increasing adoption of artificial intelligence across domains presents new opportunities to enhance access to justice. In this paper, we introduce a human-centric AI tool that utilises advances in Automatic Speech Recognition (ASR) and Large Language Models (LLMs) to facilitate semantic linking between [...] Read more.
The increasing adoption of artificial intelligence across domains presents new opportunities to enhance access to justice. In this paper, we introduce a human-centric AI tool that utilises advances in Automatic Speech Recognition (ASR) and Large Language Models (LLMs) to facilitate semantic linking between written UK Supreme Court (SC) judgements and their corresponding hearing videos. The motivation stems from the critical role UK SC hearings play in shaping landmark legal decisions, which often span several hours and remain difficult to navigate manually. Our approach involves two key components: (1) a customised ASR system fine-tuned on 139 h of manually edited SC hearing transcripts and legal documents and (2) a semantic linking module powered by GPT-based text embeddings adapted to the legal domain. The ASR system addresses domain-specific transcription challenges by incorporating a custom language model and legal phrase extraction techniques. The semantic linking module uses fine-tuned embeddings to match judgement paragraphs with relevant spans in the hearing transcripts. Quantitative evaluation shows that our customised ASR system improves transcription accuracy by 9% compared to generic ASR baselines. Furthermore, our adapted GPT embeddings achieve an F1 score of 0.85 in classifying relevant links between judgement text and hearing transcript segments. These results demonstrate the effectiveness of our system in streamlining access to critical legal information and supporting legal professionals in interpreting complex judicial decisions. Full article
(This article belongs to the Special Issue Computational Linguistics: From Text to Speech Technologies)
Show Figures

Figure 1

38 pages, 3579 KB  
Systematic Review
Integrating Artificial Intelligence and Extended Reality in Language Education: A Systematic Literature Review (2017–2024)
by Weijian Yan, Belle Li and Victoria L. Lowell
Educ. Sci. 2025, 15(8), 1066; https://doi.org/10.3390/educsci15081066 - 19 Aug 2025
Viewed by 673
Abstract
This systematic literature review examines the integration of Artificial Intelligence (AI) and Extended Reality (XR) technologies in language education, synthesizing findings from 32 empirical studies published between 2017 and 2024. Guided by the PRISMA framework, we searched four databases—ERIC, Web of Science, Scopus, [...] Read more.
This systematic literature review examines the integration of Artificial Intelligence (AI) and Extended Reality (XR) technologies in language education, synthesizing findings from 32 empirical studies published between 2017 and 2024. Guided by the PRISMA framework, we searched four databases—ERIC, Web of Science, Scopus, and IEEE Xplore—to identify studies that explicitly integrated both AI and XR to support language learning. The review explores publication trends, educational settings, target languages, language skills, learning outcomes, and theoretical frameworks, and analyzes how AI–XR technologies have been pedagogically integrated, and identifies affordances, challenges, design considerations, and future directions of AI–XR integration. Key integration strategies include coupling AI with XR technologies such as automatic speech recognition, natural language processing, computer vision, and conversational agents to support skills like speaking, vocabulary, writing, and intercultural competence. The reported affordances pertain to technical, pedagogical, and affective dimensions. However, challenges persist in terms of technical limitations, pedagogical constraints, scalability and generalizability, ethical and human-centered concerns, and infrastructure and cost barriers. Design recommendations and future directions emphasize the need for adaptive AI dialogue systems, broader pedagogical applications, longitudinal studies, learner-centered interaction, scalable and accessible design, and evaluation. This review offers a comprehensive synthesis to guide researchers, educators, and developers in designing effective AI–XR language learning experiences. Full article
(This article belongs to the Section Technology Enhanced Education)
Show Figures

Figure 1

31 pages, 5187 KB  
Article
Investigation of ASR Models for Low-Resource Kazakh Child Speech: Corpus Development, Model Adaptation, and Evaluation
by Diana Rakhimova, Zhansaya Duisenbekkyzy and Eşref Adali
Appl. Sci. 2025, 15(16), 8989; https://doi.org/10.3390/app15168989 - 14 Aug 2025
Viewed by 201
Abstract
This study focuses on the development and evaluation of automatic speech recognition (ASR) systems for Kazakh child speech, an underexplored domain in both linguistic and computational research. A specialized acoustic corpus was constructed for children aged 2 to 8 years, incorporating age-related vocabulary [...] Read more.
This study focuses on the development and evaluation of automatic speech recognition (ASR) systems for Kazakh child speech, an underexplored domain in both linguistic and computational research. A specialized acoustic corpus was constructed for children aged 2 to 8 years, incorporating age-related vocabulary stratification and gender variation to capture phonetic and prosodic diversity. The data were collected from three sources: a custom-designed Telegram bot, high-quality Dictaphone recordings, and naturalistic speech samples recorded in home and preschool environments. Four ASR models, Whisper, DeepSpeech, ESPnet, and Vosk, were evaluated. Whisper, ESPnet, and DeepSpeech were fine-tuned on the curated corpus, while Vosk was applied in its standard pretrained configuration. Performance was measured using five evaluation metrics: Word Error Rate (WER), BLEU, Translation Edit Rate (TER), Character Similarity Rate (CSRF2), and Accuracy. The results indicate that ESPnet achieved the highest accuracy (32%) and the lowest WER (0.242) for sentences, while Whisper performed well in semantically rich utterances (Accuracy = 33%; WER = 0.416). Vosk demonstrated the best performance on short words (Accuracy = 68%) and yielded the highest BLEU score (0.600) for short words. DeepSpeech showed moderate improvements in accuracy, particularly for short words (Accuracy = 60%), but faced challenges with longer utterances, achieving an Accuracy of 25% for sentences. These findings emphasize the critical importance of age-appropriate corpora and domain-specific adaptation when developing ASR systems for low-resource child speech, particularly in educational and therapeutic contexts. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

24 pages, 5649 KB  
Article
Bangla Speech Emotion Recognition Using Deep Learning-Based Ensemble Learning and Feature Fusion
by Md. Shahid Ahammed Shakil, Fahmid Al Farid, Nitun Kumar Podder, S. M. Hasan Sazzad Iqbal, Abu Saleh Musa Miah, Md Abdur Rahim and Hezerul Abdul Karim
J. Imaging 2025, 11(8), 273; https://doi.org/10.3390/jimaging11080273 - 14 Aug 2025
Viewed by 329
Abstract
Emotion recognition in speech is essential for enhancing human–computer interaction (HCI) systems. Despite progress in Bangla speech emotion recognition, challenges remain, including low accuracy, speaker dependency, and poor generalization across emotional expressions. Previous approaches often rely on traditional machine learning or basic deep [...] Read more.
Emotion recognition in speech is essential for enhancing human–computer interaction (HCI) systems. Despite progress in Bangla speech emotion recognition, challenges remain, including low accuracy, speaker dependency, and poor generalization across emotional expressions. Previous approaches often rely on traditional machine learning or basic deep learning models, struggling with robustness and accuracy in noisy or varied data. In this study, we propose a novel multi-stream deep learning feature fusion approach for Bangla speech emotion recognition, addressing the limitations of existing methods. Our approach begins with various data augmentation techniques applied to the training dataset, enhancing the model’s robustness and generalization. We then extract a comprehensive set of handcrafted features, including Zero-Crossing Rate (ZCR), chromagram, spectral centroid, spectral roll-off, spectral contrast, spectral flatness, Mel-Frequency Cepstral Coefficients (MFCCs), Root Mean Square (RMS) energy, and Mel-spectrogram. Although these features are used as 1D numerical vectors, some of them are computed from time–frequency representations (e.g., chromagram, Mel-spectrogram) that can themselves be depicted as images, which is conceptually close to imaging-based analysis. These features capture key characteristics of the speech signal, providing valuable insights into the emotional content. Sequentially, we utilize a multi-stream deep learning architecture to automatically learn complex, hierarchical representations of the speech signal. This architecture consists of three distinct streams: the first stream uses 1D convolutional neural networks (1D CNNs), the second integrates 1D CNN with Long Short-Term Memory (LSTM), and the third combines 1D CNNs with bidirectional LSTM (Bi-LSTM). These models capture intricate emotional nuances that handcrafted features alone may not fully represent. For each of these models, we generate predicted scores and then employ ensemble learning with a soft voting technique to produce the final prediction. This fusion of handcrafted features, deep learning-derived features, and ensemble voting enhances the accuracy and robustness of emotion identification across multiple datasets. Our method demonstrates the effectiveness of combining various learning models to improve emotion recognition in Bangla speech, providing a more comprehensive solution compared with existing methods. We utilize three primary datasets—SUBESCO, BanglaSER, and a merged version of both—as well as two external datasets, RAVDESS and EMODB, to assess the performance of our models. Our method achieves impressive results with accuracies of 92.90%, 85.20%, 90.63%, 67.71%, and 69.25% for the SUBESCO, BanglaSER, merged SUBESCO and BanglaSER, RAVDESS, and EMODB datasets, respectively. These results demonstrate the effectiveness of combining handcrafted features with deep learning-based features through ensemble learning for robust emotion recognition in Bangla speech. Full article
(This article belongs to the Section Computer Vision and Pattern Recognition)
Show Figures

Figure 1

10 pages, 724 KB  
Article
Real-Time Speech-to-Text on Edge: A Prototype System for Ultra-Low Latency Communication with AI-Powered NLP
by Stefano Di Leo, Luca De Cicco and Saverio Mascolo
Information 2025, 16(8), 685; https://doi.org/10.3390/info16080685 - 11 Aug 2025
Viewed by 845
Abstract
This paper presents a real-time speech-to-text (STT) system designed for edge computing environments requiring ultra-low latency and local processing. Differently from cloud-based STT services, the proposed solution runs entirely on a local infrastructure which allows the enforcement of user privacy and provides high [...] Read more.
This paper presents a real-time speech-to-text (STT) system designed for edge computing environments requiring ultra-low latency and local processing. Differently from cloud-based STT services, the proposed solution runs entirely on a local infrastructure which allows the enforcement of user privacy and provides high performance in bandwidth-limited or offline scenarios. The designed system is based on a browser-native audio capture through WebRTC, real-time streaming with WebSocket, and offline automatic speech recognition (ASR) utilizing the Vosk engine. A natural language processing (NLP) component, implemented as a microservice, improves transcription results for spelling accuracy and clarity. Our prototype reaches sub-second end-to-end latency and strong transcription capabilities under realistic conditions. Furthermore, the modular architecture allows extensibility, integration of advanced AI models, and domain-specific adaptations. Full article
(This article belongs to the Section Information Applications)
Show Figures

Figure 1

15 pages, 856 KB  
Article
Automated Assessment of Word- and Sentence-Level Speech Intelligibility in Developmental Motor Speech Disorders: A Cross-Linguistic Investigation
by Micalle Carl and Michal Icht
Diagnostics 2025, 15(15), 1892; https://doi.org/10.3390/diagnostics15151892 - 28 Jul 2025
Viewed by 295
Abstract
Background/Objectives: Accurate assessment of speech intelligibility is necessary for individuals with motor speech disorders. Transcription or scaled rating methods by naïve listeners are the most reliable tasks for these purposes; however, they are often resource-intensive and time-consuming within clinical contexts. Automatic speech [...] Read more.
Background/Objectives: Accurate assessment of speech intelligibility is necessary for individuals with motor speech disorders. Transcription or scaled rating methods by naïve listeners are the most reliable tasks for these purposes; however, they are often resource-intensive and time-consuming within clinical contexts. Automatic speech recognition (ASR) systems, which transcribe speech into text, have been increasingly utilized for assessing speech intelligibility. This study investigates the feasibility of using an open-source ASR system to assess speech intelligibility in Hebrew and English speakers with Down syndrome (DS). Methods: Recordings from 65 Hebrew- and English-speaking participants were included: 33 speakers with DS and 32 typically developing (TD) peers. Speech samples (words, sentences) were transcribed using Whisper (OpenAI) and by naïve listeners. The proportion of agreement between ASR transcriptions and those of naïve listeners was compared across speaker groups (TD, DS) and languages (Hebrew, English) for word-level data. Further comparisons for Hebrew speakers were conducted across speaker groups and stimuli (words, sentences). Results: The strength of the correlation between listener and ASR transcription scores varied across languages, and was higher for English (r = 0.98) than for Hebrew (r = 0.81) for speakers with DS. A higher proportion of listener–ASR agreement was demonstrated for TD speakers, as compared to those with DS (0.94 vs. 0.74, respectively), and for English, in comparison to Hebrew speakers (0.91 for English DS speakers vs. 0.74 for Hebrew DS speakers). Listener–ASR agreement for single words was consistently higher than for sentences among Hebrew speakers. Speakers’ intelligibility influenced word-level agreement among Hebrew- but not English-speaking participants with DS. Conclusions: ASR performance for English closely approximated that of naïve listeners, suggesting potential near-future clinical applicability within single-word intelligibility assessment. In contrast, a lower proportion of agreement between human listeners and ASR for Hebrew speech indicates that broader clinical implementation may require further training of ASR models in this language. Full article
(This article belongs to the Special Issue Evaluation and Management of Developmental Disabilities)
Show Figures

Figure 1

15 pages, 1359 KB  
Article
Phoneme-Aware Hierarchical Augmentation and Semantic-Aware SpecAugment for Low-Resource Cantonese Speech Recognition
by Lusheng Zhang, Shie Wu and Zhongxun Wang
Sensors 2025, 25(14), 4288; https://doi.org/10.3390/s25144288 - 9 Jul 2025
Viewed by 539
Abstract
Cantonese Automatic Speech Recognition (ASR) is hindered by tonal complexity, acoustic diversity, and a lack of labelled data. This study proposes a phoneme-aware hierarchical augmentation framework that enhances performance without additional annotation. A Phoneme Substitution Matrix (PSM), built from Montreal Forced Aligner alignments [...] Read more.
Cantonese Automatic Speech Recognition (ASR) is hindered by tonal complexity, acoustic diversity, and a lack of labelled data. This study proposes a phoneme-aware hierarchical augmentation framework that enhances performance without additional annotation. A Phoneme Substitution Matrix (PSM), built from Montreal Forced Aligner alignments and Tacotron-2 synthesis, injects adversarial phoneme variants into both transcripts and their aligned audio segments, enlarging pronunciation diversity. Concurrently, a semantic-aware SpecAugment scheme exploits wav2vec 2.0 attention heat maps and keyword boundaries to adaptively mask informative time–frequency regions; a reinforcement-learning controller tunes the masking schedule online, forcing the model to rely on a wider context. On the Common Voice Cantonese 50 h subset, the combined strategy reduces the character error rate (CER) from 26.17% to 16.88% with wav2vec 2.0 and from 38.83% to 23.55% with Zipformer. At 100 h, the CER further drops to 4.27% and 2.32%, yielding relative gains of 32–44%. Ablation studies confirm that phoneme-level and masking components provide complementary benefits. The framework offers a practical, model-independent path toward accurate ASR for Cantonese and other low-resource tonal languages. This paper presents an intelligent sensing-oriented modeling framework for speech signals, which is suitable for deployment on edge or embedded systems to process input from audio sensors (e.g., microphones) and shows promising potential for voice-interactive terminal applications. Full article
Show Figures

Figure 1

19 pages, 2212 KB  
Article
A Self-Evaluated Bilingual Automatic Speech Recognition System for Mandarin–English Mixed Conversations
by Xinhe Hai, Kaviya Aranganadin, Cheng-Cheng Yeh, Zhengmao Hua, Chen-Yun Huang, Hua-Yi Hsu and Ming-Chieh Lin
Appl. Sci. 2025, 15(14), 7691; https://doi.org/10.3390/app15147691 - 9 Jul 2025
Viewed by 754
Abstract
Bilingual communication is increasingly prevalent in this globally connected world, where cultural exchanges and international interactions are unavoidable. Existing automatic speech recognition (ASR) systems are often limited to single languages. However, the growing demand for bilingual ASR in human–computer interactions, particularly in medical [...] Read more.
Bilingual communication is increasingly prevalent in this globally connected world, where cultural exchanges and international interactions are unavoidable. Existing automatic speech recognition (ASR) systems are often limited to single languages. However, the growing demand for bilingual ASR in human–computer interactions, particularly in medical services, has become indispensable. This article addresses this need by creating an application programming interface (API)-based platform using VOSK, a popular open-source single-language ASR toolkit, to efficiently deploy a self-evaluated bilingual ASR system that seamlessly handles both primary and secondary languages in tasks like Mandarin–English mixed-speech recognition. The mixed error rate (MER) is used as a performance metric, and a workflow is outlined for its calculation using the edit distance algorithm. Results show a remarkable reduction in the Mandarin–English MER, dropping from ∼65% to under 13%, after implementing the self-evaluation framework and mixed-language algorithms. These findings highlight the importance of a well-designed system to manage the complexities of mixed-language speech recognition, offering a promising method for building a bilingual ASR system using existing monolingual models. The framework might be further extended to a trilingual or multilingual ASR system by preparing mixed-language datasets and computer development without involving complex training. Full article
Show Figures

Figure 1

22 pages, 4293 KB  
Article
Speech-Based Parkinson’s Detection Using Pre-Trained Self-Supervised Automatic Speech Recognition (ASR) Models and Supervised Contrastive Learning
by Hadi Sedigh Malekroodi, Nuwan Madusanka, Byeong-il Lee and Myunggi Yi
Bioengineering 2025, 12(7), 728; https://doi.org/10.3390/bioengineering12070728 - 1 Jul 2025
Viewed by 1169
Abstract
Diagnosing Parkinson’s disease (PD) through speech analysis is a promising area of research, as speech impairments are often one of the early signs of the disease. This study investigates the efficacy of fine-tuning pre-trained Automatic Speech Recognition (ASR) models, specifically Wav2Vec 2.0 and [...] Read more.
Diagnosing Parkinson’s disease (PD) through speech analysis is a promising area of research, as speech impairments are often one of the early signs of the disease. This study investigates the efficacy of fine-tuning pre-trained Automatic Speech Recognition (ASR) models, specifically Wav2Vec 2.0 and HuBERT, for PD detection using transfer learning. These models, pre-trained on large unlabeled datasets, can be capable of learning rich speech representations that capture acoustic markers of PD. The study also proposes the integration of a supervised contrastive (SupCon) learning approach to enhance the models’ ability to distinguish PD-specific features. Additionally, the proposed ASR-based features were compared against two common acoustic feature sets: mel-frequency cepstral coefficients (MFCCs) and the extended Geneva minimalistic acoustic parameter set (eGeMAPS) as a baseline. We also employed a gradient-based method, Grad-CAM, to visualize important speech regions contributing to the models’ predictions. The experiments, conducted using the NeuroVoz dataset, demonstrated that features extracted from the pre-trained ASR models exhibited superior performance compared to the baseline features. The results also reveal that the method integrating SupCon consistently outperforms traditional cross-entropy (CE)-based models. Wav2Vec 2.0 and HuBERT with SupCon achieved the highest F1 scores of 90.0% and 88.99%, respectively. Additionally, their AUC scores in the ROC analysis surpassed those of the CE models, which had comparatively lower AUCs, ranging from 0.84 to 0.89. These results highlight the potential of ASR-based models as scalable, non-invasive tools for diagnosing and monitoring PD, offering a promising avenue for the early detection and management of this debilitating condition. Full article
Show Figures

Figure 1

15 pages, 847 KB  
Data Descriptor
Mixtec–Spanish Parallel Text Dataset for Language Technology Development
by Hermilo Santiago-Benito, Diana-Margarita Córdova-Esparza, Juan Terven, Noé-Alejandro Castro-Sánchez, Teresa García-Ramirez, Julio-Alejandro Romero-González and José M. Álvarez-Alvarado
Data 2025, 10(7), 94; https://doi.org/10.3390/data10070094 - 21 Jun 2025
Viewed by 596
Abstract
This article introduces a freely available Spanish–Mixtec parallel corpus designed to foster natural language processing (NLP) development for an indigenous language that remains digitally low-resourced. The dataset, comprising 14,587 sentence pairs, covers Mixtec variants from Guerrero (Tlacoachistlahuaca, Northern Guerrero, and Xochapa) and Oaxaca [...] Read more.
This article introduces a freely available Spanish–Mixtec parallel corpus designed to foster natural language processing (NLP) development for an indigenous language that remains digitally low-resourced. The dataset, comprising 14,587 sentence pairs, covers Mixtec variants from Guerrero (Tlacoachistlahuaca, Northern Guerrero, and Xochapa) and Oaxaca (Western Coast, Southern Lowland, Santa María Yosoyúa, Central, Lower Cañada, Western Central, San Antonio Huitepec, Upper Western, and Southwestern Central). Texts are classified into four main domains as follows: education, law, health, and religion. To compile these data, we conducted a two-phase collection process as follows: first, an online search of government portals, religious organizations, and Mixtec language blogs; and second, an on-site retrieval of physical texts from the library of the Autonomous University of Querétaro. Scanning and optical character recognition were then performed to digitize physical materials, followed by manual correction to fix character misreadings and remove duplicates or irrelevant segments. We conducted a preliminary evaluation of the collected data to validate its usability in automatic translation systems. From Spanish to Mixtec, a fine-tuned GPT-4o-mini model yielded a BLEU score of 0.22 and a TER score of 122.86, while two fine-tuned open source models mBART-50 and M2M-100 yielded BLEU scores of 4.2 and 2.63 and TER scores of 98.99 and 104.87, respectively. All code demonstrating data usage, along with the final corpus itself, is publicly accessible via GitHub and Figshare. We anticipate that this resource will enable further research into machine translation, speech recognition, and other NLP applications while contributing to the broader goal of preserving and revitalizing the Mixtec language. Full article
Show Figures

Figure 1

6 pages, 175 KB  
Proceeding Paper
Comparative Analysis of Energy Consumption and Carbon Footprint in Automatic Speech Recognition Systems: A Case Study Comparing Whisper and Google Speech-to-Text
by Jalal El Bahri, Mohamed Kouissi and Mohammed Achkari Begdouri
Comput. Sci. Math. Forum 2025, 10(1), 6; https://doi.org/10.3390/cmsf2025010006 - 16 Jun 2025
Viewed by 432
Abstract
This study investigates the energy consumption and carbon footprint of two prominent automatic speech recognition (ASR) systems: OpenAI’s Whisper and Google’s Speech-to-Text API. We evaluate both local and cloud-based speech recognition approaches using a public Kaggle dataset of 20,000 short audio clips in [...] Read more.
This study investigates the energy consumption and carbon footprint of two prominent automatic speech recognition (ASR) systems: OpenAI’s Whisper and Google’s Speech-to-Text API. We evaluate both local and cloud-based speech recognition approaches using a public Kaggle dataset of 20,000 short audio clips in Urdu, utilizing CodeCarbon, PyJoule, and PowerAPI for comprehensive energy profiling. As a result of our analysis, we expose some substantial differences between the two systems in terms of energy efficiency and carbon emissions, with the cloud-based solution showing substantially lower environmental impact despite comparable accuracy. We discuss the implications of these findings for sustainable AI deployment and minimizing the ecological footprint of speech recognition technologies. Full article
24 pages, 1461 KB  
Article
Syllable-, Bigram-, and Morphology-Driven Pseudoword Generation in Greek
by Kosmas Kosmidis, Vassiliki Apostolouda and Anthi Revithiadou
Appl. Sci. 2025, 15(12), 6582; https://doi.org/10.3390/app15126582 - 11 Jun 2025
Viewed by 523
Abstract
Pseudowords are essential in (psycho)linguistic research, offering a way to study language without meaning interference. Various methods for creating pseudowords exist, but each has its limitations. Traditional approaches modify existing words, risking unintended recognition. Modern algorithmic methods use high-frequency n-grams or syllable [...] Read more.
Pseudowords are essential in (psycho)linguistic research, offering a way to study language without meaning interference. Various methods for creating pseudowords exist, but each has its limitations. Traditional approaches modify existing words, risking unintended recognition. Modern algorithmic methods use high-frequency n-grams or syllable deconstruction but often require specialized expertise. Currently, no automatic process for pseudoword generation is designed explicitly for Greek, which is our primary focus. Therefore, we developed SyBig-r-Morph, a novel application that constructs pseudowords using syllables as the main building block, replicating Greek phonotactic patterns. SyBig-r-Morph draws input from word lists and databases that include syllabification, word length, part of speech, and frequency information. It categorizes syllables by position to ensure phonotactic consistency with user-selected morphosyntactic categories and can optionally assign stress to generated words. Additionally, the tool uses multiple lexicons to eliminate phonologically invalid combinations. Its modular architecture allows easy adaptation to other languages. To further evaluate its output, we conducted a manual assessment using a tool that verifies phonotactic well-formedness based on phonological parameters derived from a corpus. Most SyBig-r-Morph words passed the stricter phonotactic criteria, confirming the tool’s sound design and linguistic adequacy. Full article
(This article belongs to the Special Issue Computational Linguistics: From Text to Speech Technologies)
Show Figures

Figure 1

17 pages, 439 KB  
Article
MultiAVSR: Robust Speech Recognition via Supervised Multi-Task Audio–Visual Learning
by Shad Torrie, Kimi Wright and Dah-Jye Lee
Electronics 2025, 14(12), 2310; https://doi.org/10.3390/electronics14122310 - 6 Jun 2025
Viewed by 1070
Abstract
Speech recognition approaches typically fall into three categories: audio, visual, and audio–visual. Visual speech recognition, or lip reading, is the most difficult because visual cues are ambiguous and data is scarce. To address these challenges, we present a new multi-task audio–visual speech recognition, [...] Read more.
Speech recognition approaches typically fall into three categories: audio, visual, and audio–visual. Visual speech recognition, or lip reading, is the most difficult because visual cues are ambiguous and data is scarce. To address these challenges, we present a new multi-task audio–visual speech recognition, or MultiAVSR, framework for training a model on all three types of speech recognition simultaneously primarily to improve visual speech recognition. Unlike prior works which use separate models or complex semi-supervision, our framework employs a supervised multi-task hybrid Connectionist Temporal Classification/Attention loss cutting training exaFLOPs to just 18% of that required by semi-supervised multitask models. MultiAVSR achieves state-of-the-art visual speech recognition word error rate of 21.0% on the LRS3-TED dataset. Furthermore, it exhibits robust generalization capabilities, achieving a remarkable 44.7% word error rate on the WildVSR dataset. Our framework also demonstrates reduced dependency on external language models, which is critical for real-time visual speech recognition. For the audio and audio–visual tasks, our framework improves the robustness under various noisy environments with average relative word error rate improvements of 16% and 31%, respectively. These improvements across the three tasks illustrate the robust results our supervised multi-task speech recognition framework enables. Full article
(This article belongs to the Special Issue Advances in Information, Intelligence, Systems and Applications)
Show Figures

Figure 1

22 pages, 305 KB  
Review
Review of Automatic Estimation of Emotions in Speech
by Douglas O’Shaughnessy
Appl. Sci. 2025, 15(10), 5731; https://doi.org/10.3390/app15105731 - 20 May 2025
Cited by 1 | Viewed by 601
Abstract
Identification of emotions exhibited in utterances is useful for many applications, e.g., assisting with handling telephone calls or psychological diagnoses. This paper reviews methods to identify emotions from speech signals. We examine the information in speech that helps to estimate emotion, from points [...] Read more.
Identification of emotions exhibited in utterances is useful for many applications, e.g., assisting with handling telephone calls or psychological diagnoses. This paper reviews methods to identify emotions from speech signals. We examine the information in speech that helps to estimate emotion, from points of view involving both production and perception. As machine approaches to recognize emotion in speech often have much in common with other speech tasks, such as automatic speaker verification and speech recognition, we compare such processes. Many methods of emotion recognition have been found in research on pattern recognition in other areas, e.g., image and text recognition, especially in recent methods for machine learning. We show that speech is very different compared to most other signals that can be recognized, and that emotion identification is different from other speech applications. This review is primarily aimed at non-experts (more algorithmic detail is present in the cited literature), but this presentation has much discussion for experts as well. Full article
(This article belongs to the Special Issue Speech Recognition and Natural Language Processing)
21 pages, 4777 KB  
Article
Harnessing Semantic and Trajectory Analysis for Real-Time Pedestrian Panic Detection in Crowded Micro-Road Networks
by Rongyong Zhao, Lingchen Han, Yuxin Cai, Bingyu Wei, Arifur Rahman, Cuiling Li and Yunlong Ma
Appl. Sci. 2025, 15(10), 5394; https://doi.org/10.3390/app15105394 - 12 May 2025
Viewed by 463
Abstract
Pedestrian panic behavior is a primary cause of overcrowding and stampede accidents in public micro-road network areas with high pedestrian density. However, reliably detecting such behaviors remains challenging due to their inherent complexity, variability, and stochastic nature. Current detection models often rely on [...] Read more.
Pedestrian panic behavior is a primary cause of overcrowding and stampede accidents in public micro-road network areas with high pedestrian density. However, reliably detecting such behaviors remains challenging due to their inherent complexity, variability, and stochastic nature. Current detection models often rely on single-modality features, which limits their effectiveness in complex and dynamic crowd scenarios. To overcome these limitations, this study proposes a contour-driven multimodal framework that first employs a CNN (CDNet) to estimate density maps and, by analyzing steep contour gradients, automatically delineates a candidate panic zone. Within these potential panic zones, pedestrian trajectories are analyzed through LSTM networks to capture irregular movements, such as counterflow and nonlinear wandering behaviors. Concurrently, semantic recognition based on Transformer models is utilized to identify verbal distress cues extracted through Baidu AI’s real-time speech-to-text conversion. The three embeddings are fused through a lightweight attention-enhanced MLP, enabling end-to-end inference at 40 FPS on a single GPU. To evaluate branch robustness under streaming conditions, the UCF Crowd dataset (150 videos without panic labels) is processed frame-by-frame at 25 FPS solely for density assessment, whereas full panic detection is validated on 30 real Itaewon-Stampede videos and 160 SUMO/Unity simulated emergencies that include explicit panic annotations. The proposed system achieves 91.7% accuracy and 88.2% F1 on the Itaewon set, outperforming all single- or dual-modality baselines and offering a deployable solution for proactive crowd safety monitoring in transport hubs, festivals, and other high-risk venues. Full article
Show Figures

Figure 1

Back to TopTop