Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (524)

Search Parameters:
Keywords = speech recognition system

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
16 pages, 1227 KB  
Article
Multimodal Behavioral Sensors for Lie Detection: Integrating Visual, Auditory, and Generative Reasoning Cues
by Daniel Grabowski, Kamila Łuczaj and Khalid Saeed
Sensors 2025, 25(19), 6086; https://doi.org/10.3390/s25196086 - 2 Oct 2025
Abstract
Advances in multimodal artificial intelligence enable new sensor-inspired approaches to lie detection by combining behavioral perception with generative reasoning. This study presents a deception detection framework that integrates deep video and audio processing with large language models guided by chain-of-thought (CoT) prompting. We [...] Read more.
Advances in multimodal artificial intelligence enable new sensor-inspired approaches to lie detection by combining behavioral perception with generative reasoning. This study presents a deception detection framework that integrates deep video and audio processing with large language models guided by chain-of-thought (CoT) prompting. We interpret neural architectures such as ViViT (for video) and HuBERT (for speech) as digital behavioral sensors that extract implicit emotional and cognitive cues, including micro-expressions, vocal stress, and timing irregularities. We further incorporate a GPT-5-based prompt-level fusion approach for video–language–emotion alignment and zero-shot inference. This method jointly processes visual frames, textual transcripts, and emotion recognition outputs, enabling the system to generate interpretable deception hypotheses without any task-specific fine-tuning. Facial expressions are treated as high-resolution affective signals captured via visual sensors, while audio encodes prosodic markers of stress. Our experimental setup is based on the DOLOS dataset, which provides high-quality multimodal recordings of deceptive and truthful behavior. We also evaluate a continual learning setup that transfers emotional understanding to deception classification. Results indicate that multimodal fusion and CoT-based reasoning increase classification accuracy and interpretability. The proposed system bridges the gap between raw behavioral data and semantic inference, laying a foundation for AI-driven lie detection with interpretable sensor analogues. Full article
(This article belongs to the Special Issue Sensor-Based Behavioral Biometrics)
43 pages, 3035 KB  
Article
Real-Time Recognition of NZ Sign Language Alphabets by Optimal Use of Machine Learning
by Seyed Ebrahim Hosseini, Mubashir Ali, Shahbaz Pervez and Muneer Ahmad
Bioengineering 2025, 12(10), 1068; https://doi.org/10.3390/bioengineering12101068 - 30 Sep 2025
Abstract
The acquisition of a person’s first language is one of their greatest accomplishments. Nevertheless, being fluent in sign language presents challenges for many deaf students who rely on it for communication. Effective communication is essential for both personal and professional interactions and is [...] Read more.
The acquisition of a person’s first language is one of their greatest accomplishments. Nevertheless, being fluent in sign language presents challenges for many deaf students who rely on it for communication. Effective communication is essential for both personal and professional interactions and is critical for community engagement. However, the lack of a mutually understood language can be a significant barrier. Estimates indicate that a large portion of New Zealand’s disability population is deaf, with an educational approach predominantly focused on oralism, emphasizing spoken language. This makes it essential to bridge the communication gap between the general public and individuals with speech difficulties. The aim of this project is to develop an application that systematically cycles through each letter and number in New Zealand Sign Language (NZSL), assessing the user’s proficiency. This research investigates various machine learning methods for hand gesture recognition, with a focus on landmark detection. In computer vision, identifying specific points on an object—such as distinct hand landmarks—is a standard approach for feature extraction. Evaluation of this system has been performed using machine learning techniques, including Random Forest (RF) Classifier, k-Nearest Neighbours (KNN), AdaBoost (AB), Naïve Bayes (NB), Support Vector Machine (SVM), Decision Trees (DT), and Logistic Regression (LR). The dataset used for model training and testing consists of approximately 100,000 hand gesture expressions, formatted into a CSV dataset for model training. Full article
(This article belongs to the Special Issue AI and Data Science in Bioengineering: Innovations and Applications)
17 pages, 2436 KB  
Article
Deep Learning System for Speech Command Recognition
by Dejan Vujičić, Đorđe Damnjanović, Dušan Marković and Zoran Stamenković
Electronics 2025, 14(19), 3793; https://doi.org/10.3390/electronics14193793 - 24 Sep 2025
Viewed by 13
Abstract
We present a deep learning model for the recognition of speech commands in the English language. The dataset is based on the Google Speech Commands Dataset by Warden P., version 0.01, and it consists of ten distinct commands (“left”, “right”, “go”, “stop”, “up”, [...] Read more.
We present a deep learning model for the recognition of speech commands in the English language. The dataset is based on the Google Speech Commands Dataset by Warden P., version 0.01, and it consists of ten distinct commands (“left”, “right”, “go”, “stop”, “up”, “down”, “on”, “off”, “yes”, and “no”) along with additional “silence” and “unknown” classes. The dataset is split in a speaker-independent manner, with 70% of speakers assigned to the training set and 15% to the test set and validation set. All audio clips are sampled at 16 kHz, with a total of 46 146 clips. Audio files are converted into Mel spectrogram representations, which are then used as input to a deep learning model composed of a four-layer convolutional neural network followed by two fully connected layers. The model employs Rectified Linear Unit (ReLU) activation, the Adam optimizer, and dropout regularization to improve generalization. The achieved testing accuracy is 96.05%. Micro- and macro-averaged precision, recall, and F1-score of 95% are reported to reflect class-wise performance, and a confusion matrix is also provided. The proposed model has been deployed on a Raspberry Pi 5 as a Fog computing device for real-time speech recognition applications. Full article
(This article belongs to the Special Issue Data-Centric Artificial Intelligence: New Methods for Data Processing)
Show Figures

Figure 1

31 pages, 3671 KB  
Article
Research on Wu Dialect Recognition and Regional Variations Based on Deep Learning
by Xinyi Yue, Lizhi Miao and Jiahao Ding
Appl. Sci. 2025, 15(18), 10227; https://doi.org/10.3390/app151810227 - 19 Sep 2025
Viewed by 259
Abstract
Wu dialects carry deep regional culture, but due to significant internal variations, automated recognition faces considerable challenges. This study focuses on speech recognition and semantic feedback for Wu dialects, proposing a deep learning system with regional adaptability and semantic feedback capabilities. First, a [...] Read more.
Wu dialects carry deep regional culture, but due to significant internal variations, automated recognition faces considerable challenges. This study focuses on speech recognition and semantic feedback for Wu dialects, proposing a deep learning system with regional adaptability and semantic feedback capabilities. First, a corpus covering multiple Wu dialect regions (WXDPC) is constructed, and a two-level phoneme mapping and regional difference modeling mechanism is introduced. By incorporating geographical region labels and transfer learning, the model’s performance in non-central regions is improved. Experimental results show that as the training corpus increases, the model’s CER significantly decreases. After the introduction of regional labels, the CER in non-central Wu dialect regions decreased by 4.5%, demonstrating the model’s effectiveness in complex dialect environments. This system provides technical support for the preservation and application of Wu dialects and offers valuable experience for the promotion of other dialect recognition systems. Full article
Show Figures

Figure 1

11 pages, 1005 KB  
Proceeding Paper
Multimodal Fusion for Enhanced Human–Computer Interaction
by Ajay Sharma, Isha Batra, Shamneesh Sharma and Anggy Pradiftha Junfithrana
Eng. Proc. 2025, 107(1), 81; https://doi.org/10.3390/engproc2025107081 - 10 Sep 2025
Viewed by 362
Abstract
Our paper introduces a novel idea of a virtual mouse character driven by gesture detection, eye-tracking, and voice monitoring. This system uses cutting-edge computer vision and machine learning technology to let users command and control the mouse pointer using eye motions, voice commands, [...] Read more.
Our paper introduces a novel idea of a virtual mouse character driven by gesture detection, eye-tracking, and voice monitoring. This system uses cutting-edge computer vision and machine learning technology to let users command and control the mouse pointer using eye motions, voice commands, or hand gestures. This system’s main goal is to provide users who want a more natural, hands-free approach to interacting with their computers as well as those with impairments that limit their bodily motions, such as those with paralysis—with an easy and engaging interface. The system improves accessibility and usability by combining many input modalities, therefore providing a flexible answer for numerous users. While the speech recognition function permits hands-free operation via voice instructions, the eye-tracking component detects and responds to the user’s gaze, therefore providing exact cursor control. Gesture recognition enhances these features even further by letting users use their hands simply to execute mouse operations. This technology not only enhances personal user experience for people with impairments but also marks a major development in human–computer interaction. It shows how computer vision and machine learning may be used to provide more inclusive and flexible user interfaces, therefore improving the accessibility and efficiency of computer usage for everyone. Full article
Show Figures

Figure 1

22 pages, 4234 KB  
Article
Speaker Recognition Based on the Combination of SincNet and Neuro-Fuzzy for Intelligent Home Service Robots
by Seo-Hyun Kim, Tae-Wan Kim and Keun-Chang Kwak
Electronics 2025, 14(18), 3581; https://doi.org/10.3390/electronics14183581 - 9 Sep 2025
Viewed by 391
Abstract
Speaker recognition has become a critical component of human–robot interaction (HRI), enabling personalized services based on user identity, as the demand for home service robots increases. In contrast to conventional speech recognition tasks, recognition in home service robot environments is affected by varying [...] Read more.
Speaker recognition has become a critical component of human–robot interaction (HRI), enabling personalized services based on user identity, as the demand for home service robots increases. In contrast to conventional speech recognition tasks, recognition in home service robot environments is affected by varying speaker–robot distances and background noises, which can significantly reduce accuracy. Traditional approaches rely on hand-crafted features, which may lose essential speaker-specific information during extraction like mel-frequency cepstral coefficients (MFCCs). To address this, we propose a novel speaker recognition technique for intelligent robots that combines SincNet-based raw waveform processing with an adaptive neuro-fuzzy inference system (ANFIS). SincNet extracts relevant frequency features by learning low- and high-cutoff frequencies in its convolutional filters, reducing parameter complexity while retaining discriminative power. To improve interpretability and handle non-linearity, ANFIS is used as the classifier, leveraging fuzzy rules generated by fuzzy c-means (FCM) clustering. The model is evaluated on a custom dataset collected in a realistic home environment with background noise, including TV sounds and mechanical noise from robot motion. Our results show that the proposed model outperforms existing CNN, CNN-ANFIS, and SincNet models in terms of accuracy. This approach offers robust performance and enhanced model transparency, making it well-suited for intelligent home robot systems. Full article
(This article belongs to the Special Issue Control and Design of Intelligent Robots)
Show Figures

Figure 1

10 pages, 2364 KB  
Proceeding Paper
AI-Powered Sign Language Detection Using YOLO-v11 for Communication Equality
by Ivana Lucia Kharisma, Irma Nurmalasari, Yuni Lestari, Salma Dela Septiani, Kamdan and Muchtar Ali Setyo Yudono
Eng. Proc. 2025, 107(1), 83; https://doi.org/10.3390/engproc2025107083 - 8 Sep 2025
Viewed by 314
Abstract
Communication plays a vital role in conveying messages, expressing emotions, and sharing perceptions, becoming a fundamental aspect of human interaction with the environment. For individuals with hearing impairments, sign language serves as an essential communication tool, enabling interaction both within the deaf community [...] Read more.
Communication plays a vital role in conveying messages, expressing emotions, and sharing perceptions, becoming a fundamental aspect of human interaction with the environment. For individuals with hearing impairments, sign language serves as an essential communication tool, enabling interaction both within the deaf community and with non-deaf individuals. This study aims to bridge this misconception by developing an iconic language recognition system using the Deep Learning-based YOLO-v11 algorithm. YOLO-v11, a state-of-the-art object detection algorithm, is known for its speed, accuracy, and efficiency. The system uses image recognition to identify hand gestures in ASL and translates them into text or speech, facilitating inclusive communication. The accuracy of the training model is 94.67%, and the accuracy of the testing model is 93.02%, indicating that the model has excellent performance in recognizing sign language from the training and testing datasets. Additionally, the model is very reliable in recognizing the classes “Hello”, “I Love You”, “No”, and “Thank You” with a sensitivity close to or equal to 100%. This research contributes to advancing communication equality for individuals with hearing impairments, promoting inclusivity, and supporting their integration into society. Full article
Show Figures

Figure 1

18 pages, 2065 KB  
Article
Phoneme-Aware Augmentation for Robust Cantonese ASR Under Low-Resource Conditions
by Lusheng Zhang, Shie Wu and Zhongxun Wang
Symmetry 2025, 17(9), 1478; https://doi.org/10.3390/sym17091478 - 8 Sep 2025
Viewed by 535
Abstract
Cantonese automatic speech recognition (ASR) faces persistent challenges due to its nine lexical tones, extensive phonological variation, and the scarcity of professionally transcribed corpora. To address these issues, we propose a lightweight and data-efficient framework that leverages weak phonetic supervision (WPS) in conjunction [...] Read more.
Cantonese automatic speech recognition (ASR) faces persistent challenges due to its nine lexical tones, extensive phonological variation, and the scarcity of professionally transcribed corpora. To address these issues, we propose a lightweight and data-efficient framework that leverages weak phonetic supervision (WPS) in conjunction with two pho-neme-aware augmentation strategies. (1) Dynamic Boundary-Aligned Phoneme Dropout progressively removes entire IPA segments according to a curriculum schedule, simulating real-world phenomena such as elision, lenition, and tonal drift while ensuring training stability. (2) Phoneme-Aware SpecAugment confines all time- and frequency-masking operations within phoneme boundaries and prioritizes high-attention regions, thereby preserving intra-phonemic contours and formant integrity. Built on the Whistle encoder—which integrates a Conformer backbone, Connectionist Temporal Classification–Conditional Random Field (CTC-CRF) alignment, and a multi-lingual phonetic space—the approach requires only a grapheme-to-phoneme lexicon and Montreal Forced Aligner outputs, without any additional manual labeling. Experiments on the Cantonese subset of Common Voice demonstrate consistent gains: Dynamic Dropout alone reduces phoneme error rate (PER) from 17.8% to 16.7% with 50 h of speech and 16.4% to 15.1% with 100 h, while the combination of the two augmentations further lowers PER to 15.9%/14.4%. These results confirm that structure-aware phoneme-level perturbations provide an effective and low-cost solution for building robust Cantonese ASR systems under low-resource conditions. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

21 pages, 471 KB  
Review
Long Short-Term Memory Networks: A Comprehensive Survey
by Moez Krichen and Alaeddine Mihoub
AI 2025, 6(9), 215; https://doi.org/10.3390/ai6090215 - 5 Sep 2025
Viewed by 1129
Abstract
Long Short-Term Memory (LSTM) networks have revolutionized the field of deep learning, particularly in applications that require the modeling of sequential data. Originally designed to overcome the limitations of traditional recurrent neural networks (RNNs), LSTMs effectively capture long-range dependencies in sequences, making them [...] Read more.
Long Short-Term Memory (LSTM) networks have revolutionized the field of deep learning, particularly in applications that require the modeling of sequential data. Originally designed to overcome the limitations of traditional recurrent neural networks (RNNs), LSTMs effectively capture long-range dependencies in sequences, making them suitable for a wide array of tasks. This survey aims to provide a comprehensive overview of LSTM architectures, detailing their unique components, such as cell states and gating mechanisms, which facilitate the retention and modulation of information over time. We delve into the various applications of LSTMs across multiple domains, including the following: natural language processing (NLP), where they are employed for language modeling, machine translation, and sentiment analysis; time series analysis, where they play a critical role in forecasting tasks; and speech recognition, significantly enhancing the accuracy of automated systems. By examining these applications, we illustrate the versatility and robustness of LSTMs in handling complex data types. Additionally, we explore several notable variants and improvements of the standard LSTM architecture, such as Bidirectional LSTMs, which enhance context understanding, and Stacked LSTMs, which increase model capacity. We also discuss the integration of Attention Mechanisms with LSTMs, which have further advanced their performance in various tasks. Despite their strengths, LSTMs face several challenges, including high Computational Complexity, extensive Data Requirements, and difficulties in training, which can hinder their practical implementation. This survey addresses these limitations and provides insights into ongoing research aimed at mitigating these issues. In conclusion, we highlight recent advances in LSTM research and propose potential future directions that could lead to enhanced performance and broader applicability of LSTM networks. This survey serves as a foundational resource for researchers and practitioners seeking to understand the current landscape of LSTM technology and its future trajectory. Full article
Show Figures

Figure 1

25 pages, 4433 KB  
Article
Mathematical Analysis and Performance Evaluation of CBAM-DenseNet121 for Speech Emotion Recognition Using the CREMA-D Dataset
by Zineddine Sarhani Kahhoul, Nadjiba Terki, Ilyes Benaissa, Khaled Aldwoah, E. I. Hassan, Osman Osman and Djamel Eddine Boukhari
Appl. Sci. 2025, 15(17), 9692; https://doi.org/10.3390/app15179692 - 3 Sep 2025
Viewed by 536
Abstract
Emotion recognition from speech is essential for human–computer interaction (HCI) and affective computing, with applications in virtual assistants, healthcare, and education. Although deep learning has made significant advancements in Automatic Speech Emotion Recognition (ASER), the challenge still exists in the task given variation [...] Read more.
Emotion recognition from speech is essential for human–computer interaction (HCI) and affective computing, with applications in virtual assistants, healthcare, and education. Although deep learning has made significant advancements in Automatic Speech Emotion Recognition (ASER), the challenge still exists in the task given variation in speakers, subtle emotional expressions, and environmental noise. Practical deployment in this context depends on a strong, fast, scalable recognition system. This work introduces a new framework combining DenseNet121, especially fine-tuned for the crowd-sourced emotional multimodal actors dataset (CREMA-D), with the convolutional block attention module (CBAM). While DenseNet121’s effective feature propagation captures rich, hierarchical patterns in the speech data, CBAM improves the focus of the model on emotionally significant elements by applying both spatial and channel-wise attention. Furthermore, enhancing the input spectrograms and strengthening resistance against environmental noise is an advanced preprocessing pipeline including log-Mel spectrogram transformation and normalization. The proposed model demonstrates superior performance. To make sure the evaluation is strong even if there is a class imbalance, we point out important metrics like an Unweighted Average Recall (UAR) of 71.01% and an F1 score of 71.25%. The model also gets a test accuracy of 71.26% and a precision of 71.30%. These results establish the model as a promising solution for real-world speech emotion detection, highlighting its strong generalization capabilities, computational efficiency, and focus on emotion-specific features compared to recent work. The improvements demonstrate practical flexibility, enabling the integration of established image recognition techniques and allowing for substantial adaptability in various application contexts. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

20 pages, 2732 KB  
Article
Redesigning Multimodal Interaction: Adaptive Signal Processing and Cross-Modal Interaction for Hands-Free Computer Interaction
by Bui Hong Quan, Nguyen Dinh Tuan Anh, Hoang Van Phi and Bui Trung Thanh
Sensors 2025, 25(17), 5411; https://doi.org/10.3390/s25175411 - 2 Sep 2025
Viewed by 575
Abstract
Hands-free computer interaction is a key topic in assistive technology, with camera-based and voice-based systems being the most common methods. Recent camera-based solutions leverage facial expressions or head movements to simulate mouse clicks or key presses, while voice-based systems enable control via speech [...] Read more.
Hands-free computer interaction is a key topic in assistive technology, with camera-based and voice-based systems being the most common methods. Recent camera-based solutions leverage facial expressions or head movements to simulate mouse clicks or key presses, while voice-based systems enable control via speech commands, wake-word detection, and vocal gestures. However, existing systems often suffer from limitations in responsiveness and accuracy, especially under real-world conditions. In this paper, we present 3-Modal Human-Computer Interaction (3M-HCI), a novel interaction system that dynamically integrates facial, vocal, and eye-based inputs through a new signal processing pipeline and a cross-modal coordination mechanism. This approach not only enhances recognition accuracy but also reduces interaction latency. Experimental results demonstrate that 3M-HCI outperforms several recent hands-free interaction solutions in both speed and precision, highlighting its potential as a robust assistive interface. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

12 pages, 304 KB  
Article
LoRA-INT8 Whisper: A Low-Cost Cantonese Speech Recognition Framework for Edge Devices
by Lusheng Zhang, Shie Wu and Zhongxun Wang
Sensors 2025, 25(17), 5404; https://doi.org/10.3390/s25175404 - 1 Sep 2025
Viewed by 887
Abstract
To address the triple bottlenecks of data scarcity, oversized models, and slow inference that hinder Cantonese automatic speech recognition (ASR) in low-resource and edge-deployment settings, this study proposes a cost-effective Cantonese ASR system based on LoRA fine-tuning and INT8 quantization. First, Whisper-tiny is [...] Read more.
To address the triple bottlenecks of data scarcity, oversized models, and slow inference that hinder Cantonese automatic speech recognition (ASR) in low-resource and edge-deployment settings, this study proposes a cost-effective Cantonese ASR system based on LoRA fine-tuning and INT8 quantization. First, Whisper-tiny is parameter-efficiently fine-tuned on the Common Voice zh-HK training set using LoRA with rank = 8. Only 1.6% of the original weights are updated, reducing the character error rate (CER) from 49.5% to 11.1%, a performance close to full fine-tuning (10.3%), while cutting the training memory footprint and computational cost by approximately one order of magnitude. Next, the fine-tuned model is compressed into a 60 MB INT8 checkpoint via dynamic quantization in ONNX Runtime. On a MacBook Pro M1 Max CPU, the quantized model achieves an RTF = 0.20 (offline inference 5 × real-time) and 43% lower latency than the FP16 baseline; on an NVIDIA A10 GPU, it reaches RTF = 0.06, meeting the requirements of high-concurrency cloud services. Ablation studies confirm that the LoRA-INT8 configuration offers the best trade-off among accuracy, speed, and model size. Limitations include the absence of spontaneous-speech noise data, extreme-hardware validation, and adaptive LoRA structure optimization. Future work will incorporate large-scale self-supervised pre-training, tone-aware loss functions, AdaLoRA architecture search, and INT4/NPU quantization, and will establish an mJ/char energy–accuracy curve. The ultimate goal is to achieve CER ≤ 8%, RTF < 0.1, and mJ/char < 1 for low-power real-time Cantonese ASR in practical IoT scenarios. Full article
(This article belongs to the Section Electronic Sensors)
Show Figures

Figure 1

23 pages, 1233 KB  
Article
Decoding the Digits: How Number Notation Influences Cognitive Effort and Performance in Chinese-to-English Sight Translation
by Xueyan Zong, Lei Song and Shanshan Yang
Behav. Sci. 2025, 15(9), 1195; https://doi.org/10.3390/bs15091195 - 1 Sep 2025
Viewed by 474
Abstract
Numbers present persistent challenges in interpreting, yet cognitive mechanisms underlying notation-specific processing remain underexplored. While eye-tracking studies in visually-assisted simultaneous interpreting have advanced number research, they predominantly examine Arabic numerals in non-Chinese contexts—neglecting notation diversity increasingly prevalent in computer-assisted interpreting systems where Automatic [...] Read more.
Numbers present persistent challenges in interpreting, yet cognitive mechanisms underlying notation-specific processing remain underexplored. While eye-tracking studies in visually-assisted simultaneous interpreting have advanced number research, they predominantly examine Arabic numerals in non-Chinese contexts—neglecting notation diversity increasingly prevalent in computer-assisted interpreting systems where Automatic Speech Recognition outputs vary across languages. Addressing these gaps, this study investigated how number notation (Arabic digits vs. Chinese character numbers) affects trainee interpreters’ cognitive effort and performance in Chinese-to-English sight translation. Employing a mixed-methods design, we measured global (task-level) and local (number-specific) eye movements alongside expert assessments, output analysis, and subjective assessments. Results show that Chinese character numbers demand significantly greater cognitive effort than Arabic digits, evidenced by more and longer fixations, more extensive saccadic movements, and a larger eye-voice span. Concurrently, sight translation quality decreased markedly with Chinese character numbers, with more processing attempts yet lower accuracy and fluency. Subjective workload ratings confirmed higher mental, physical, and temporal demands in Task 2. These findings reveal an effort-quality paradox where greater cognitive investment in processing complex notations leads to poorer outcomes, and highlight the urgent need for notation-specific training strategies and adaptive technologies in multilingual communication. Full article
(This article belongs to the Section Cognition)
Show Figures

Figure 1

15 pages, 1780 KB  
Article
Prosodic Spatio-Temporal Feature Fusion with Attention Mechanisms for Speech Emotion Recognition
by Kristiawan Nugroho, Imam Husni Al Amin, Nina Anggraeni Noviasari and De Rosal Ignatius Moses Setiadi
Computers 2025, 14(9), 361; https://doi.org/10.3390/computers14090361 - 31 Aug 2025
Viewed by 659
Abstract
Speech Emotion Recognition (SER) plays a vital role in supporting applications such as healthcare, human–computer interaction, and security. However, many existing approaches still face challenges in achieving robust generalization and maintaining high recall, particularly for emotions related to stress and anxiety. This study [...] Read more.
Speech Emotion Recognition (SER) plays a vital role in supporting applications such as healthcare, human–computer interaction, and security. However, many existing approaches still face challenges in achieving robust generalization and maintaining high recall, particularly for emotions related to stress and anxiety. This study proposes a dual-stream hybrid model that combines prosodic features with spatio-temporal representations derived from the Multitaper Mel-Frequency Spectrogram (MTMFS) and the Constant-Q Transform Spectrogram (CQTS). Prosodic cues, including pitch, intensity, jitter, shimmer, HNR, pause rate, and speech rate, were processed using dense layers, while MTMFS and CQTS features were encoded with CNN and BiGRU. A Multi-Head Attention mechanism was then applied to adaptively fuse the two feature streams, allowing the model to focus on the most relevant emotional cues. Evaluations conducted on the RAVDESS dataset with subject-independent 5-fold cross-validation demonstrated an accuracy of 97.64% and a macro F1-score of 0.9745. These results confirm that combining prosodic and advanced spectrogram features with attention-based fusion improves precision, recall, and overall robustness, offering a promising framework for more reliable SER systems. Full article
(This article belongs to the Special Issue Multimodal Pattern Recognition of Social Signals in HCI (2nd Edition))
Show Figures

Graphical abstract

15 pages, 252 KB  
Article
Enhanced Neural Speech Recognition of Quranic Recitations via a Large Audio Model
by Mohammad Alshboul, Abdul Rahman Al Muaitah, Suhad Al-Issa and Mahmoud Al-Ayyoub
Appl. Sci. 2025, 15(17), 9521; https://doi.org/10.3390/app15179521 - 29 Aug 2025
Viewed by 680
Abstract
In this work, we build on our recent work toward developing a neural speech recognition (NSR) for Quranic recitations that is accessible to people of any age, gender, or expertise level. The Quran recitations by females and males (QRFAM) dataset, a sizable benchmark [...] Read more.
In this work, we build on our recent work toward developing a neural speech recognition (NSR) for Quranic recitations that is accessible to people of any age, gender, or expertise level. The Quran recitations by females and males (QRFAM) dataset, a sizable benchmark dataset of audio recordings made by male and female reciters from various age groups and competence levels, was previously reported in our prior works. In addition to this dataset, we used various subsets of the QRFAM dataset for training, validation, and testing to build several basic NSR systems based on Mozilla’s DeepSpeech model. Our current efforts to optimize and enhance these baseline models have also been presented. In this study, we expand our efforts by utilizing one of the well-known speech recognition models, Whisper, and we describe the effect of this choice on the model’s accuracy, expressed as the word error rate (WER), in comparison to that of DeepSpeech. Full article
Back to TopTop