Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (84)

Search Parameters:
Keywords = voice metrics

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
32 pages, 852 KB  
Article
Benchmarking the Responsiveness of Open-Source Text-to-Speech Systems
by Ha Pham Thien Dinh, Rutherford Agbeshi Patamia, Ming Liu and Akansel Cosgun
Computers 2025, 14(10), 406; https://doi.org/10.3390/computers14100406 - 23 Sep 2025
Viewed by 134
Abstract
Responsiveness—the speed at which a text-to-speech (TTS) system produces audible output—is critical for real-time voice assistants yet has received far less attention than perceptual quality metrics. Existing evaluations often touch on latency but do not establish reproducible, open-source standards that capture responsiveness as [...] Read more.
Responsiveness—the speed at which a text-to-speech (TTS) system produces audible output—is critical for real-time voice assistants yet has received far less attention than perceptual quality metrics. Existing evaluations often touch on latency but do not establish reproducible, open-source standards that capture responsiveness as a first-class dimension. This work introduces a baseline benchmark designed to fill that gap. Our framework unifies latency distribution, tail latency, and intelligibility within a transparent and dataset-diverse pipeline, enabling a fair and replicable comparison across 13 widely used open-source TTS models. By grounding evaluation in structured input sets ranging from single words to sentence-length utterances and adopting a methodology inspired by standardized inference benchmarks, we capture both typical and worst-case user experiences. Unlike prior studies that emphasize closed or proprietary systems, our focus is on establishing open, reproducible baselines rather than ranking against commercial references. The results reveal substantial variability across architectures, with some models delivering near-instant responses while others fail to meet interactive thresholds. By centering evaluation on responsiveness and reproducibility, this study provides an infrastructural foundation for benchmarking TTS systems and lays the groundwork for more comprehensive assessments that integrate both fidelity and speed. Full article
Show Figures

Figure 1

25 pages, 4385 KB  
Article
Robust DeepFake Audio Detection via an Improved NeXt-TDNN with Multi-Fused Self-Supervised Learning Features
by Gul Tahaoglu
Appl. Sci. 2025, 15(17), 9685; https://doi.org/10.3390/app15179685 - 3 Sep 2025
Viewed by 1000
Abstract
Deepfake audio refers to speech that has been synthetically generated or altered through advanced neural network techniques, often with a degree of realism sufficient to convincingly imitate genuine human voices. As these manipulations become increasingly indistinguishable from authentic recordings, they present significant threats [...] Read more.
Deepfake audio refers to speech that has been synthetically generated or altered through advanced neural network techniques, often with a degree of realism sufficient to convincingly imitate genuine human voices. As these manipulations become increasingly indistinguishable from authentic recordings, they present significant threats to security, undermine media integrity, and challenge the reliability of digital authentication systems. In this study, a robust detection framework is proposed, which leverages the power of self-supervised learning (SSL) and attention-based modeling to identify deepfake audio samples. Specifically, audio features are extracted from input speech using two powerful pretrained SSL models: HuBERT-Large and WavLM-Large. These distinctive features are then integrated through an Attentional Multi-Feature Fusion (AMFF) mechanism. The fused features are subsequently classified using a NeXt-Time Delay Neural Network (NeXt-TDNN) model enhanced with Efficient Channel Attention (ECA), enabling improved temporal and channel-wise feature discrimination. Experimental results show that the proposed method achieves a 0.42% EER and 0.01 min-tDCF on ASVspoof 2019 LA, a 1.01% EER on ASVspoof 2019 PA, and a pooled 6.56% EER on the cross-channel ASVspoof 2021 LA evaluation, thus highlighting its effectiveness for real-world deepfake detection scenarios. Furthermore, on the ASVspoof 5 dataset, the method achieved a 7.23% EER, outperforming strong baselines and demonstrating strong generalization ability. Moreover, the macro-averaged F1-score of 96.01% and balanced accuracy of 99.06% were obtained on the ASVspoof 2019 LA dataset, while the proposed method achieved a macro-averaged F1-score of 98.70% and balanced accuracy of 98.90% on the ASVspoof 2019 PA dataset. On the highly challenging ASVspoof 5 dataset, which includes crowdsourced, non-studio-quality audio, and novel adversarial attacks, the proposed method achieves macro-averaged metrics exceeding 92%, with a precision of 92.07%, a recall of 92.63%, an F1-measure of 92.35%, and a balanced accuracy of 92.63%. Full article
Show Figures

Figure 1

24 pages, 4245 KB  
Article
Healthy Movement Leads to Emotional Connection: Development of the Movement Poomasi “Wello!” Application Based on Digital Psychosocial Touch—A Mixed-Methods Study
by Suyoung Hwang, Hyunmoon Kim and Eun-Surk Yi
Healthcare 2025, 13(17), 2157; https://doi.org/10.3390/healthcare13172157 - 29 Aug 2025
Viewed by 494
Abstract
Background/Objective: The global acceleration of population aging presents profound challenges to the physical, psychological, and social well-being of older adults. As traditional exercise programs face limitations in accessibility, personalization, and sustained social support, there is a critical need for innovative, inclusive, and community-integrated [...] Read more.
Background/Objective: The global acceleration of population aging presents profound challenges to the physical, psychological, and social well-being of older adults. As traditional exercise programs face limitations in accessibility, personalization, and sustained social support, there is a critical need for innovative, inclusive, and community-integrated digital movement solutions. This study aimed to develop and evaluate Movement Poomasi, a hybrid digital healthcare application designed to promote physical activity, improve digital accessibility, and strengthen social connectedness among older adults. Methods: From March 2023 to November 2023, Movement Poomasi was developed through an iterative user-centered design process involving domain experts in physical therapy and sports psychology. In this study, the term UI/UX—short for user interface and user experience—refers to the overall design and interaction framework of the application, encompassing visual layout, navigation flow, accessibility features, and user engagement optimization tailored to older adults’ sensory, cognitive, and motor characteristics. The application integrates adaptive exercise modules, senior-optimized UI/UX, voice-assisted navigation, and peer-interaction features to enable both home-based and in-person movement engagement. A two-phase usability validation was conducted. A 4-week pilot test with 15 older adults assessed the prototype, followed by a formal 6-week study with 50 participants (≥65 years), stratified by digital literacy and activity background. Quantitative metrics—movement completion rates, session duration, and engagement with social features—were analyzed alongside semi-structured interviews. Statistical analysis included ANOVA and regression to examine usability and engagement outcomes. The application has continued iterative testing and refinement until May 2025, and it is scheduled for re-launch under the name Wello! in August 2025. Results: Post-implementation UI refinements significantly increased navigation success rates (from 68% to 87%, p = 0.042). ANOVA revealed that movement selection and peer-interaction tasks posed greater cognitive load (p < 0.01). A strong positive correlation was found between digital literacy and task performance (r = 0.68, p < 0.05). Weekly participation increased by 38%, with 81% of participants reporting enhanced social connectedness through group challenges and hybrid peer-led meetups. Despite high satisfaction scores (mean 4.6 ± 0.4), usability challenges remained among low-literacy users, indicating the need for further interface simplification. Conclusions: The findings underscore the potential of hybrid digital platforms tailored to older adults’ physical, cognitive, and social needs. Movement Poomasi demonstrates scalable feasibility and contributes to reducing the digital divide while fostering active aging. Future directions include AI-assisted onboarding, adaptive tutorials, and expanded integration with community care ecosystems to enhance long-term engagement and inclusivity. Full article
(This article belongs to the Special Issue Emerging Technologies for Person-Centred Healthcare)
Show Figures

Figure 1

21 pages, 1662 KB  
Article
Controllable Speech-Driven Gesture Generation with Selective Activation of Weakly Supervised Controls
by Karlo Crnek and Matej Rojc
Appl. Sci. 2025, 15(17), 9467; https://doi.org/10.3390/app15179467 - 28 Aug 2025
Viewed by 446
Abstract
Generating realistic and contextually appropriate gestures is crucial for creating engaging embodied conversational agents. Although speech is the primary input for gesture generation, adding controls like gesture velocity, hand height, and emotion is essential for generating more natural, human-like gestures. However, current approaches [...] Read more.
Generating realistic and contextually appropriate gestures is crucial for creating engaging embodied conversational agents. Although speech is the primary input for gesture generation, adding controls like gesture velocity, hand height, and emotion is essential for generating more natural, human-like gestures. However, current approaches to controllable gesture generation often utilize a limited number of control parameters and lack the ability to activate/deactivate them selectively. Therefore, in this work, we propose the Cont-Gest model, a Transformer-based gesture generation model that enables selective control activation through masked training and a control fusion strategy. Furthermore, to better support the development of such models, we propose a novel evaluation-driven development (EDD) workflow, which combines several iterative tasks: automatic control signal extraction, control specification, visual (subjective) feedback, and objective evaluation. This workflow enables continuous monitoring of model performance and facilitates iterative refinement through feedback-driven development cycles. For objective evaluation, we are using the validated Kinetic–Hellinger distance, an objective metric that correlates strongly with the human perception of gesture quality. We evaluated multiple model configurations and control dynamics strategies within the proposed workflow. Experimental results show that Feature-wise Linear Modulation (FiLM) conditioning, combined with single-mask training and voice activity scaling, achieves the best balance between gesture quality and adherence to control inputs. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

18 pages, 1149 KB  
Article
Advanced Cryptography Using Nanoantennas in Wireless Communication
by Francisco Alves, João Paulo N. Torres, P. Mendonça dos Santos and Ricardo A. Marques Lameirinhas
Information 2025, 16(9), 720; https://doi.org/10.3390/info16090720 - 22 Aug 2025
Viewed by 438
Abstract
This work presents an end-to-end encryption–decryption framework for securing electromagnetic signals processed through a nanoantenna. The system integrates amplitude normalization, uniform quantization, and Reed–Solomon forward error correction with key establishment via ECDH and bitwise XOR encryption. Two signal types were evaluated: a synthetic [...] Read more.
This work presents an end-to-end encryption–decryption framework for securing electromagnetic signals processed through a nanoantenna. The system integrates amplitude normalization, uniform quantization, and Reed–Solomon forward error correction with key establishment via ECDH and bitwise XOR encryption. Two signal types were evaluated: a synthetic Gaussian pulse and a synthetic voice waveform, representing low- and high-entropy data, respectively. For the Gaussian signal, reconstruction achieved an RMSE = 11.42, MAE = 0.86, PSNR = 26.97 dB, and Pearson’s correlation coefficient = 0.8887. The voice signal exhibited elevated error metrics, with an RMSE = 15.13, MAE = 2.52, PSNR = 24.54 dB, and Pearson correlation = 0.8062, yet maintained adequate fidelity. Entropy analysis indicated minimal changes between the original signal and the reconstructed signal. Furthermore, avalanche testing confirmed strong key sensitivity, with single-bit changes in the key altering approximately 50% of the ciphertext bits. The findings indicate that the proposed pipeline ensures high reconstruction quality with lightweight encryption, rendering it suitable for environments with limited computational resources. Full article
(This article belongs to the Section Information and Communications Technology)
Show Figures

Figure 1

42 pages, 1982 KB  
Article
SHAP-Based Identification of Potential Acoustic Biomarkers in Patients with Post-Thyroidectomy Voice Disorder
by Salih Celepli, Irem Bigat, Bilgi Karakas, Huseyin Mert Tezcan, Mehmet Dincay Yar, Pinar Celepli, Mehmet Feyzi Aksahin, Oguz Hancerliogullari, Yavuz Fuat Yilmaz and Osman Erogul
Diagnostics 2025, 15(16), 2065; https://doi.org/10.3390/diagnostics15162065 - 18 Aug 2025
Viewed by 631
Abstract
Objective: The objective of this study was to identify potential robust acoustic biomarkers for functional post-thyroidectomy voice disorder (PTVD) that may support early diagnosis and personalized treatment strategies, using acoustic analysis and explainable machine learning methods. Methods: Spectral and cepstral features were extracted [...] Read more.
Objective: The objective of this study was to identify potential robust acoustic biomarkers for functional post-thyroidectomy voice disorder (PTVD) that may support early diagnosis and personalized treatment strategies, using acoustic analysis and explainable machine learning methods. Methods: Spectral and cepstral features were extracted from /a/ and /i/ voice recordings collected preoperatively and 4–6 weeks postoperatively from a total of 126 patients. Various Support Vector Machine (SVM) and Boosting models were trained. SHapley Additive exPlanations (SHAP) analysis was applied to enhance interpretability. SHAP values from training and test sets were compared via scatter plots to identify stable candidate biomarkers with high consistency. Results: GentleBoost (AUC = 0.85) and LogitBoost (AUC = 0.81) demonstrated the highest classification performance. Performance metrics across all models were evaluated for statistical significance. DeLong’s test was conducted to assess differences between ROC curves. The features iCPP, aCPP, and aHNR were identified as stable candidate biomarkers, exhibiting consistent SHAP distributions in both training and test sets in terms of direction and magnitude. These features showed statistically significant correlations with PTVD (p < 0.05) and demonstrated strong effect sizes (Cohen’s d = −2.95, −1.13, −0.60). Their diagnostic relevance was further supported by post hoc power analyses (iCPP: 1.00; aCPP: 0.998). Conclusions: SHAP-supported machine learning models offer an objective and clinically meaningful approach for evaluating PTVD. The identified features may serve as potential biomarkers to guide individualized voice therapy decisions during the early postoperative period. Full article
(This article belongs to the Special Issue A New Era in Diagnosis: From Biomarkers to Artificial Intelligence)
Show Figures

Figure 1

26 pages, 514 KB  
Article
Improving Voice Spoofing Detection Through Extensive Analysis of Multicepstral Feature Reduction
by Leonardo Mendes de Souza, Rodrigo Capobianco Guido, Rodrigo Colnago Contreras, Monique Simplicio Viana and Marcelo Adriano dos Santos Bongarti
Sensors 2025, 25(15), 4821; https://doi.org/10.3390/s25154821 - 5 Aug 2025
Viewed by 975
Abstract
Voice biometric systems play a critical role in numerous security applications, including electronic device authentication, banking transaction verification, and confidential communications. Despite their widespread utility, these systems are increasingly targeted by sophisticated spoofing attacks that leverage advanced artificial intelligence techniques to generate realistic [...] Read more.
Voice biometric systems play a critical role in numerous security applications, including electronic device authentication, banking transaction verification, and confidential communications. Despite their widespread utility, these systems are increasingly targeted by sophisticated spoofing attacks that leverage advanced artificial intelligence techniques to generate realistic synthetic speech. Addressing the vulnerabilities inherent to voice-based authentication systems has thus become both urgent and essential. This study proposes a novel experimental analysis that extensively explores various dimensionality reduction strategies in conjunction with supervised machine learning models to effectively identify spoofed voice signals. Our framework involves extracting multicepstral features followed by the application of diverse dimensionality reduction methods, such as Principal Component Analysis (PCA), Truncated Singular Value Decomposition (SVD), statistical feature selection (ANOVA F-value, Mutual Information), Recursive Feature Elimination (RFE), regularization-based LASSO selection, Random Forest feature importance, and Permutation Importance techniques. Empirical evaluation using the ASVSpoof 2017 v2.0 dataset measures the classification performance with the Equal Error Rate (EER) metric, achieving values of approximately 10%. Our comparative analysis demonstrates significant performance gains when dimensionality reduction methods are applied, underscoring their value in enhancing the security and effectiveness of voice biometric verification systems against emerging spoofing threats. Full article
(This article belongs to the Special Issue Sensors and Machine-Learning Based Signal Processing)
Show Figures

Figure 1

19 pages, 1039 KB  
Article
Prediction of Parkinson Disease Using Long-Term, Short-Term Acoustic Features Based on Machine Learning
by Mehdi Rashidi, Serena Arima, Andrea Claudio Stetco, Chiara Coppola, Debora Musarò, Marco Greco, Marina Damato, Filomena My, Angela Lupo, Marta Lorenzo, Antonio Danieli, Giuseppe Maruccio, Alberto Argentiero, Andrea Buccoliero, Marcello Dorian Donzella and Michele Maffia
Brain Sci. 2025, 15(7), 739; https://doi.org/10.3390/brainsci15070739 - 10 Jul 2025
Viewed by 939
Abstract
Background: Parkinson’s disease (PD) is the second most common neurodegenerative disorder after Alzheimer’s disease, affecting countless individuals worldwide. PD is characterized by the onset of a marked motor symptomatology in association with several non-motor manifestations. The clinical phase of the disease is usually [...] Read more.
Background: Parkinson’s disease (PD) is the second most common neurodegenerative disorder after Alzheimer’s disease, affecting countless individuals worldwide. PD is characterized by the onset of a marked motor symptomatology in association with several non-motor manifestations. The clinical phase of the disease is usually preceded by a long prodromal phase, devoid of overt motor symptomatology but often showing some conditions such as sleep disturbance, constipation, anosmia, and phonatory changes. To date, speech analysis appears to be a promising digital biomarker to anticipate even 10 years before the onset of clinical PD, as well serving as a useful prognostic tool for patient follow-up. That is why, the voice can be nominated as the non-invasive method to detect PD from healthy subjects (HS). Methods: Our study was based on cross-sectional study to analysis voice impairment. A dataset comprising 81 voice samples (41 from healthy individuals and 40 from PD patients) was utilized to train and evaluate common machine learning (ML) models using various types of features, including long-term (jitter, shimmer, and cepstral peak prominence (CPP)), short-term features (Mel-frequency cepstral coefficient (MFCC)), and non-standard measurements (pitch period entropy (PPE) and recurrence period density entropy (RPDE)). The study adopted multiple machine learning (ML) algorithms, including random forest (RF), K-nearest neighbors (KNN), decision tree (DT), naïve Bayes (NB), support vector machines (SVM), and logistic regression (LR). Cross-validation technique was applied to ensure the reliability of performance metrics on train and test subsets. These metrics (accuracy, recall, and precision), help determine the most effective models for distinguishing PD from healthy subjects. Result: Among all the algorithms used in this research, random forest (RF) was the best-performing model, achieving an accuracy of 82.72% with a ROC-AUC score of 89.65%. Although other models, such as support vector machine (SVM), could be considered with an accuracy of 75.29% and a ROC-AUC score of 82.63%, RF was by far the best one when evaluated across all metrics. The K-nearest neighbor (KNN) and decision tree (DT) performed the worst. Notably, by combining a comprehensive set of long-term, short-term, and non-standard acoustic features, unlike previous studies that typically focused on only a subset, our study achieved higher predictive performance, offering a more robust model for early PD detection. Conclusions: This study highlights the potential of combining advanced acoustic analysis with ML algorithms to develop non-invasive and reliable tools for early PD detection, offering substantial benefits for the healthcare sector. Full article
(This article belongs to the Section Neurodegenerative Diseases)
Show Figures

Figure 1

12 pages, 494 KB  
Article
Design of a Dual-Path Speech Enhancement Model
by Seorim Hwang, Sung Wook Park and Youngcheol Park
Appl. Sci. 2025, 15(11), 6358; https://doi.org/10.3390/app15116358 - 5 Jun 2025
Viewed by 1010
Abstract
Although both noise suppression and speech restoration are fundamental to speech enhancement, many Deep neural network (DNN)-based approaches tend to focus disproportionately on one, often overlooking the importance of their joint handling. In this study, we propose a dual-path architecture designed to balance [...] Read more.
Although both noise suppression and speech restoration are fundamental to speech enhancement, many Deep neural network (DNN)-based approaches tend to focus disproportionately on one, often overlooking the importance of their joint handling. In this study, we propose a dual-path architecture designed to balance noise suppression and speech restoration. The main path consists of an encoder and two specialized decoders: one dedicated to estimating the clean speech spectrum and the other to predicting a noise suppression mask. To reinforce the joint modeling of noise suppression and speech restoration, we introduce an auxiliary refinement path. This path consists of a separate encoder–decoder structure and is designed to further refine the enhanced speech by incorporating complementary information, learned independently from the main path. By using this dual-path architecture, the model better preserves fine speech details while reducing residual noise. Experimental results on the VoiceBank + DEMAND dataset show that our model surpasses conventional methods across multiple evaluation metrics in the causal setup. Specifically, it achieves a PESQ score of 3.33, reflecting improved speech quality, and a CSIG score of 4.48, indicating enhanced intelligibility. Furthermore, it demonstrates superior noise suppression, achieving an SNRseg of 10.44 and a CBAK score of 3.75. Full article
(This article belongs to the Special Issue Application of Deep Learning in Speech Enhancement Technology)
Show Figures

Figure 1

25 pages, 2106 KB  
Perspective
Digital Biomarkers and AI for Remote Monitoring of Fatigue Progression in Neurological Disorders: Bridging Mechanisms to Clinical Applications
by Thorsten Rudroff
Brain Sci. 2025, 15(5), 533; https://doi.org/10.3390/brainsci15050533 - 21 May 2025
Cited by 1 | Viewed by 2010
Abstract
Digital biomarkers for fatigue monitoring in neurological disorders represent an innovative approach to bridge the gap between mechanistic understanding and clinical application. This perspective paper examines how smartphone-derived measures, analyzed through artificial intelligence methods, can transform fatigue assessment from subjective, episodic reporting to [...] Read more.
Digital biomarkers for fatigue monitoring in neurological disorders represent an innovative approach to bridge the gap between mechanistic understanding and clinical application. This perspective paper examines how smartphone-derived measures, analyzed through artificial intelligence methods, can transform fatigue assessment from subjective, episodic reporting to continuous, objective monitoring. The proposed framework for smartphone-based digital phenotyping captures passive data (movement patterns, device interactions, and sleep metrics) and active assessments (ecological momentary assessments, cognitive tests, and voice analysis). These digital biomarkers can be validated through a multimodal approach connecting them to neuroimaging markers, clinical assessments, performance measures, and patient-reported experiences. Building on the previous research on frontal–striatal metabolism in multiple sclerosis and Long-COVID-19 patients, digital biomarkers could enable early warning systems for fatigue episodes, objective treatment response monitoring, and personalized fatigue management strategies. Implementation considerations include privacy protection, equity concerns, and regulatory pathways. By integrating smartphone-derived digital biomarkers with AI analysis approaches, the future envisions fatigue in neurological disorders no longer as an invisible, subjective experience but rather as a quantifiable, treatable phenomenon with established neural correlates and effective interventions. This transformative approach has significant potential to enhance both clinical care and the research for millions affected by disabling fatigue symptoms. Full article
(This article belongs to the Section Neurotechnology and Neuroimaging)
Show Figures

Figure 1

27 pages, 15968 KB  
Article
MPFM-VC: A Voice Conversion Algorithm Based on Multi-Dimensional Perception Flow Matching
by Yanze Wang, Xuming Han, Shuai Lv, Ting Zhou and Yali Chu
Appl. Sci. 2025, 15(10), 5503; https://doi.org/10.3390/app15105503 - 14 May 2025
Viewed by 2572
Abstract
Voice conversion (VC) is an advanced technology that enables the transformation of raw speech into high-quality audio resembling the target speaker’s voice while preserving the original linguistic content and prosodic patterns. In this study, we propose a voice conversion algorithm, Multi-Dimensional Perception Flow [...] Read more.
Voice conversion (VC) is an advanced technology that enables the transformation of raw speech into high-quality audio resembling the target speaker’s voice while preserving the original linguistic content and prosodic patterns. In this study, we propose a voice conversion algorithm, Multi-Dimensional Perception Flow Matching (MPFM-VC). Unlike traditional approaches that directly generate waveform outputs, MPFM-VC models the evolutionary trajectory of mel spectrograms with a flow-matching framework and incorporates a multi-dimensional feature perception network to enhance the stability and quality of speech synthesis. Additionally, we introduce a content perturbation method during training to improve the model’s generalization ability and reduce inference-time artifacts. To further increase speaker similarity, an adversarial training mechanism on speaker embeddings is employed to achieve effective disentanglement between content and speaker identity representations, thereby enhancing the timbre consistency of the converted speech. Experimental results for both speech and singing voice conversion tasks show that MPFM-VC achieves competitive performance compared to existing state-of-the-art VC models in both subjective and objective evaluation metrics. The synthesized speech shows improved naturalness, clarity, and timbre fidelity in both objective and subjective evaluations, suggesting the potential effectiveness of the proposed approach. Full article
(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)
Show Figures

Figure 1

19 pages, 904 KB  
Article
Enhancing Subband Speech Processing: Integrating Multi-View Attention Module into Inter-SubNet for Superior Speech Enhancement
by Jeih-Weih Hung, Tsung-Jung Li and Bo-Yu Su
Electronics 2025, 14(8), 1640; https://doi.org/10.3390/electronics14081640 - 18 Apr 2025
Viewed by 1348
Abstract
The Inter-SubNet speechenhancement network improves subband interaction by enabling the exchange of complementary information across frequency bands, ensuring robust feature refinement while significantly reducing computational load through lightweight, subband-specific modules. Despite its compact design, it outperforms state-of-the-art models such as FullSubNet, FullSubNet+, Conv-TasNet, [...] Read more.
The Inter-SubNet speechenhancement network improves subband interaction by enabling the exchange of complementary information across frequency bands, ensuring robust feature refinement while significantly reducing computational load through lightweight, subband-specific modules. Despite its compact design, it outperforms state-of-the-art models such as FullSubNet, FullSubNet+, Conv-TasNet, and DCCRN+, offering a highly efficient yet powerful solution. To further enhance its performance, we propose integrating a Multi-view Attention (MA) module as a front-end or intermediate component. The MA module utilizes attention mechanisms across channel, global, and local views to emphasize critical features, ensuring comprehensive speech signal processing. Evaluations on the VoiceBank-DEMAND dataset show that incorporating the MA module significantly improves metrics like SI-SNR, PESQ, and STOI, demonstrating its effectiveness in enhancing subband feature extraction and overall speech enhancement performance. Full article
(This article belongs to the Special Issue IoT Security in the Age of AI: Innovative Approaches and Technologies)
Show Figures

Figure 1

10 pages, 1379 KB  
Proceeding Paper
Recognizing Human Emotions Through Body Posture Dynamics Using Deep Neural Networks
by Arunnehru Jawaharlalnehru, Thalapathiraj Sambandham and Dhanasekar Ravikumar
Eng. Proc. 2025, 87(1), 49; https://doi.org/10.3390/engproc2025087049 - 16 Apr 2025
Viewed by 1459
Abstract
Body posture dynamics have garnered significant attention in recent years due to their critical role in understanding the emotional states conveyed through human movements during social interactions. Emotions are typically expressed through facial expressions, voice, gait, posture, and overall body dynamics. Among these, [...] Read more.
Body posture dynamics have garnered significant attention in recent years due to their critical role in understanding the emotional states conveyed through human movements during social interactions. Emotions are typically expressed through facial expressions, voice, gait, posture, and overall body dynamics. Among these, body posture provides subtle yet essential cues about emotional states. However, predicting an individual’s gait and posture dynamics poses challenges, given the complexity of human body movement, which involves numerous degrees of freedom compared to facial expressions. Moreover, unlike static facial expressions, body dynamics are inherently fluid and continuously evolving. This paper presents an effective method for recognizing 17 micro-emotions by analyzing kinematic features from the GEMEP dataset using video-based motion capture. We specifically focus on upper body posture dynamics (skeleton points and angle), capturing movement patterns and their dynamic range over time. Our approach addresses the complexity of recognizing emotions from posture and gait by focusing on key elements of kinematic gesture analysis. The experimental results demonstrate the effectiveness of the proposed model, achieving a high accuracy rate of 91.48% for angle metric + DNN and 93.89% for distance + DNN on the GEMEP dataset using a deep neural network (DNN). These findings highlight the potential for our model to advance posture-based emotion recognition, particularly in applications where human body dynamics distance and angle are key indicators of emotional states. Full article
(This article belongs to the Proceedings of The 5th International Electronic Conference on Applied Sciences)
Show Figures

Figure 1

20 pages, 6941 KB  
Article
EmoSDS: Unified Emotionally Adaptive Spoken Dialogue System Using Self-Supervised Speech Representations
by Jaehwan Lee, Youngjun Sim, Jinyou Kim and Young-Joo Suh
Future Internet 2025, 17(4), 143; https://doi.org/10.3390/fi17040143 - 25 Mar 2025
Cited by 2 | Viewed by 1050
Abstract
In recent years, advancements in artificial intelligence, speech, and natural language processing technology have enhanced spoken dialogue systems (SDSs), enabling natural, voice-based human–computer interaction. However, discrete, token-based LLMs in emotionally adaptive SDSs focus on lexical content while overlooking essential paralinguistic cues for emotion [...] Read more.
In recent years, advancements in artificial intelligence, speech, and natural language processing technology have enhanced spoken dialogue systems (SDSs), enabling natural, voice-based human–computer interaction. However, discrete, token-based LLMs in emotionally adaptive SDSs focus on lexical content while overlooking essential paralinguistic cues for emotion expression. Existing methods use external emotion predictors to compensate for this but introduce computational overhead and fail to fully integrate paralinguistic features with linguistic context. Moreover, the lack of high-quality emotional speech datasets limits models’ ability to learn expressive emotional cues. To address these challenges, we propose EmoSDS, a unified SDS framework that integrates speech and emotion recognition by leveraging self-supervised learning (SSL) features. Our three-stage training pipeline enables the LLM to learn both discrete linguistic content and continuous paralinguistic features, improving emotional expressiveness and response naturalness. Additionally, we construct EmoSC, a dataset combining GPT-generated dialogues with emotional voice conversion data, ensuring greater emotional diversity and a balanced sample distribution across emotion categories. The experimental results show that EmoSDS outperforms existing models in emotional alignment and response generation, achieving a minimum 2.9% increase in text generation metrics, enhancing the LLM’s ability to interpret emotional and textual cues for more expressive and contextually appropriate responses. Full article
(This article belongs to the Special Issue Generative Artificial Intelligence in Smart Societies)
Show Figures

Figure 1

16 pages, 25849 KB  
Article
A Hybrid Approach to Semantic Digital Speech: Enabling Gradual Transition in Practical Communication Systems
by Münif Zeybek, Bilge Kartal Çetin and Erkan Zeki Engin
Electronics 2025, 14(6), 1130; https://doi.org/10.3390/electronics14061130 - 13 Mar 2025
Cited by 1 | Viewed by 1307
Abstract
Recent advances in deep learning have fostered a transition from the traditional, bit-centric paradigm of Shannon’s information theory to a semantic-oriented approach, emphasizing the transmission of meaningful information rather than mere data fidelity. However, black-box AI-based semantic communication lacks structured discretization and remains [...] Read more.
Recent advances in deep learning have fostered a transition from the traditional, bit-centric paradigm of Shannon’s information theory to a semantic-oriented approach, emphasizing the transmission of meaningful information rather than mere data fidelity. However, black-box AI-based semantic communication lacks structured discretization and remains dependent on analog modulation, which presents deployment challenges. This paper presents a new semantic-aware digital speech communication system, named Hybrid-DeepSCS, a stepping stone between traditional and fully end-to-end semantic communication. Our system comprises the following parts: a semantic encoder for extracting and compressing structured features, a standard transmitter for digital modulation including source and channel encoding, a standard receiver for recovering the bitstream, and a semantic decoder for expanding the features and reconstructing speech. By adding semantic encoding to a standard digital transmission, our system works with existing communication networks while exploring the potential of deep learning for feature representation and reconstruction. This hybrid method allows for gradual implementation, making it more practical for real-world uses like low-bandwidth speech, robust voice transmission over wireless networks, and AI-assisted speech on edge devices. The system’s compatibility with conventional digital infrastructure positions it as a viable solution for IoT deployments, where seamless integration with legacy systems and energy-efficient processing are critical. Furthermore, our approach addresses IoT-specific challenges such as bandwidth constraints in industrial sensor networks and latency-sensitive voice interactions in smart environments. We test the system under various channel conditions using Signal-to-Distortion Ratio (SDR), PESQ, and STOI metrics. The results show that our system delivers robust and clear speech, connecting traditional wireless systems with the future of AI-driven communication. The framework’s adaptability to edge computing architectures further underscores its relevance for IoT platforms, enabling efficient semantic processing in resource-constrained environments. Full article
(This article belongs to the Special Issue Application of Artificial Intelligence in Wireless Communications)
Show Figures

Figure 1

Back to TopTop