MDPI - Publisher of Open Access Journals

20 pages, 3941 KB

Open AccessArticle

Self-Supervised Voice Denoising Network for Multi-Scenario Human–Robot Interaction

by Mu Li, Wenjin Xu, Chao Zeng and Ning Wang

Biomimetics 2025, 10(9), 603; https://doi.org/10.3390/biomimetics10090603 - 9 Sep 2025

Viewed by 418

Human–robot interaction (HRI) via voice command has significantly advanced in recent years, with large Vision–Language–Action (VLA) models demonstrating particular promise in human–robot voice interaction. However, these systems still struggle with environmental noise contamination during voice interaction and lack a specialized denoising network for [...] Read more.

Human–robot interaction (HRI) via voice command has significantly advanced in recent years, with large Vision–Language–Action (VLA) models demonstrating particular promise in human–robot voice interaction. However, these systems still struggle with environmental noise contamination during voice interaction and lack a specialized denoising network for multi-speaker command isolation in an overlapping speech scenario. To overcome these challenges, we introduce a method to enhance voice command-based HRI in noisy environments, leveraging synthetic data and a self-supervised denoising network to enhance its real-world applicability. Our approach focuses on improving self-supervised network performance in denoising mixed-noise audio through training data scaling. Extensive experiments show our method outperforms existing approaches in simulation and achieves 7.5% higher accuracy than the state-of-the-art method in noisy real-world environments, enhancing voice-guided robot control. Full article

(This article belongs to the Special Issue Intelligent Human–Robot Interaction: 4th Edition)

► Show Figures

Figure 1

15 pages, 1508 KB

Open AccessArticle

Simultaneous Speech Denoising and Super-Resolution Using mGLFB-Based U-Net, Fine-Tuned via Perceptual Loss

by Hwai-Tsu Hu and Hao-Hsuan Tsai

Electronics 2025, 14(17), 3466; https://doi.org/10.3390/electronics14173466 - 29 Aug 2025

Viewed by 448

Abstract

This paper presents an efficient U-Net architecture featuring a modified Global Local Former Block (mGLFB) for simultaneous speech denoising and resolution reconstruction. Optimized for computational efficiency in the discrete cosine transform domain, the proposed architecture reduces model size by 13.5% compared to a [...] Read more.

This paper presents an efficient U-Net architecture featuring a modified Global Local Former Block (mGLFB) for simultaneous speech denoising and resolution reconstruction. Optimized for computational efficiency in the discrete cosine transform domain, the proposed architecture reduces model size by 13.5% compared to a standard GLFB-based U-Net, while maintaining comparable performance across multiple quality metrics. In addition to the mGLFB redesign, we introduce a perceptual loss that better captures high-frequency magnitude spectra, yielding notable gains in high-resolution recovery, especially in unvoiced speech segments. However, the mGLFB-based U-Net still shows limitations in retrieving spectral details with substantial energy in 4–6 kHz frequencies. Full article

(This article belongs to the Special Issue Artificial Intelligence and Advanced Signal Processing Techniques and Their Applications)

► Show Figures

Figure 1

19 pages, 1711 KB

Open AccessArticle

TSDCA-BA: An Ultra-Lightweight Speech Enhancement Model for Real-Time Hearing Aids with Multi-Scale STFT Fusion

by Zujie Fan, Zikun Guo, Yanxing Lai and Jaesoo Kim

Appl. Sci. 2025, 15(15), 8183; https://doi.org/10.3390/app15158183 - 23 Jul 2025

Viewed by 1112

Abstract

Lightweight speech denoising models have made remarkable progress in improving both speech quality and computational efficiency. However, most models rely on long temporal windows as input, limiting their applicability in low-latency, real-time scenarios on edge devices. To address this challenge, we propose a [...] Read more.

Lightweight speech denoising models have made remarkable progress in improving both speech quality and computational efficiency. However, most models rely on long temporal windows as input, limiting their applicability in low-latency, real-time scenarios on edge devices. To address this challenge, we propose a lightweight hybrid module, Temporal Statistics Enhancement, Squeeze-and-Excitation-based Dual Convolutional Attention, and Band-wise Attention (TSE, SDCA, BA) Module. The TSE module enhances single-frame spectral features by concatenating statistical descriptors—mean, standard deviation, maximum, and minimum—thereby capturing richer local information without relying on temporal context. The SDCA and BA module integrates a simplified residual structure and channel attention, while the BA component further strengthens the representation of critical frequency bands through band-wise partitioning and differentiated weighting. The proposed model requires only 0.22 million multiply–accumulate operations (MMACs) and contains a total of 112.3 K parameters, making it well suited for low-latency, real-time speech enhancement applications. Experimental results demonstrate that among lightweight models with fewer than 200K parameters, the proposed approach outperforms most existing methods in both denoising performance and computational efficiency, significantly reducing processing overhead. Furthermore, real-device deployment on an improved hearing aid confirms an inference latency as low as 2 milliseconds, validating its practical potential for real-time edge applications. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

25 pages, 1964 KB

Open AccessArticle

Hate Speech Detection and Online Public Opinion Regulation Using Support Vector Machine Algorithm: Application and Impact on Social Media

by Siyuan Li and Zhi Li

Information 2025, 16(5), 344; https://doi.org/10.3390/info16050344 - 24 Apr 2025

Viewed by 1147

Abstract

Detecting hate speech in social media is challenging due to its rarity, high-dimensional complexity, and implicit expression via sarcasm or spelling variations, rendering linear models ineffective. In this study, the SVM (Support Vector Machine) algorithm is used to map text features from low-dimensional [...] Read more.

Detecting hate speech in social media is challenging due to its rarity, high-dimensional complexity, and implicit expression via sarcasm or spelling variations, rendering linear models ineffective. In this study, the SVM (Support Vector Machine) algorithm is used to map text features from low-dimensional to high-dimensional space using kernel function techniques to meet complex nonlinear classification challenges. By maximizing the category interval to locate the optimal hyperplane and combining nuclear techniques to implicitly adjust the data distribution, the classification accuracy of hate speech detection is significantly improved. Data collection leverages social media APIs (Application Programming Interface) and customized crawlers with OAuth2.0 authentication and keyword filtering, ensuring relevance. Regular expressions validate data integrity, followed by preprocessing steps such as denoising, stop-word removal, and spelling correction. Word embeddings are generated using Word2Vec’s Skip-gram model, combined with TF-IDF (Term Frequency–Inverse Document Frequency) weighting to capture contextual semantics. A multi-level feature extraction framework integrates sentiment analysis via lexicon-based methods and BERT for advanced sentiment recognition. Experimental evaluations on two datasets demonstrate the SVM model’s effectiveness, achieving accuracies of 90.42% and 92.84%, recall rates of 88.06% and 90.79%, and average inference times of 3.71 ms and 2.96 ms. These results highlight the model’s ability to detect implicit hate speech accurately and efficiently, supporting real-time monitoring. This research contributes to creating a safer online environment by advancing hate speech detection methodologies. Full article

(This article belongs to the Special Issue Information Technology in Society)

► Show Figures

Figure 1

21 pages, 6196 KB

Open AccessArticle

Building a Gender-Bias-Resistant Super Corpus as a Deep Learning Baseline for Speech Emotion Recognition

by Babak Abbaschian and Adel Elmaghraby

Sensors 2025, 25(7), 1991; https://doi.org/10.3390/s25071991 - 22 Mar 2025

Viewed by 756

Abstract

The focus on Speech Emotion Recognition has dramatically increased in recent years, driven by the need for automatic speech-recognition-based systems and intelligent assistants to enhance user experience by incorporating emotional content. While deep learning techniques have significantly advanced SER systems, their robustness concerning [...] Read more.

The focus on Speech Emotion Recognition has dramatically increased in recent years, driven by the need for automatic speech-recognition-based systems and intelligent assistants to enhance user experience by incorporating emotional content. While deep learning techniques have significantly advanced SER systems, their robustness concerning speaker gender and out-of-distribution data has not been thoroughly examined. Furthermore, standards for SER remain rooted in landmark papers from the 2000s, even though modern deep learning architectures can achieve comparable or superior results to the state of the art of that era. In this research, we address these challenges by creating a new super corpus from existing databases, providing a larger pool of samples. We benchmark this dataset using various deep learning architectures, setting a new baseline for the task. Additionally, our experiments reveal that models trained on this super corpus demonstrate superior generalization and accuracy and exhibit lower gender bias compared to models trained on individual databases. We further show that traditional preprocessing techniques, such as denoising and normalization, are insufficient to address inherent biases in the data. However, our data augmentation approach effectively shifts these biases, improving model fairness across gender groups and emotions and, in some cases, fully debiasing the models. Full article

(This article belongs to the Special Issue Emotion Recognition and Cognitive Behavior Analysis Based on Sensors)

► Show Figures

Graphical abstract

13 pages, 2233 KB

Open AccessArticle

High-Quality Text-to-Speech Implementation via Active Shallow Diffusion Mechanism

by Junlin Deng, Ruihan Hou, Yan Deng, Yongqiu Long and Ning Wu

Sensors 2025, 25(3), 833; https://doi.org/10.3390/s25030833 - 30 Jan 2025

Viewed by 2469

Abstract

Denoising diffusion probabilistic models (DDPMs) have proven to be useful in text-to-speech (TTS) tasks; however, it has been a challenge for traditional diffusion models to carry out real-time processing because of the need for hundreds of sampling steps during the iteration. In this [...] Read more.

Denoising diffusion probabilistic models (DDPMs) have proven to be useful in text-to-speech (TTS) tasks; however, it has been a challenge for traditional diffusion models to carry out real-time processing because of the need for hundreds of sampling steps during the iteration. In this work, a two-stage fast inference and efficient diffusion-based acoustic model of TTS, the Cascaded MixGAN-TTS (CMG-TTS), is proposed to address this problem. An active shallow diffusion mechanism is adopted to divide the CMG-TTS training process into two stages. Specifically, a basic acoustic model in the first stage is trained to provide valuable a priori knowledge for the second stage, and for the underlying acoustic modeling, a mixture combination mechanism-based linguistic encoder is introduced to work with pitch and energy predictors. In the following stage of processing, a post-net is used to optimize the mel-spectrogram reconstruction performance. The CMG-TTS is evaluated on datasets such as the AISHELL3 and LJSpeech, and the experiments show that the CMG-TTS achieves satisfactory results in both subjective and objective evaluation metrics with only one denoising step. Compared to other TTS models based on diffusion modeling, the CMG-TTS obtains a leading score in the real time factor (RTF), and both stages of the CMG-TTS are effective in the ablation studies. Full article

(This article belongs to the Special Issue Sensors and Machine-Learning Based Signal Processing)

► Show Figures

Figure 1

15 pages, 4088 KB

Open AccessArticle

Options for Performing DNN-Based Causal Speech Denoising Using the U-Net Architecture

by Hwai-Tsu Hu and Tung-Tsun Lee

Appl. Syst. Innov. 2024, 7(6), 120; https://doi.org/10.3390/asi7060120 - 29 Nov 2024

Viewed by 1840

Abstract

Speech enhancement technology seeks to improve the quality and intelligibility of speech signals degraded by noise, particularly in telephone communications. Recent advancements have focused on leveraging deep neural networks (DNN), especially U-Net architectures, for effective denoising. In this study, we evaluate the performance [...] Read more.

Speech enhancement technology seeks to improve the quality and intelligibility of speech signals degraded by noise, particularly in telephone communications. Recent advancements have focused on leveraging deep neural networks (DNN), especially U-Net architectures, for effective denoising. In this study, we evaluate the performance of a 6-level skip-connected U-Net constructed using either conventional convolution activation blocks (CCAB) or innovative global local former blocks (GLFB) across different processing domains: temporal waveform, short-time Fourier transform (STFT), and short-time discrete cosine transform (STDCT). Our results indicate that the U-Nets can receive better signal-to-noise ratio (SNR) and perceptual evaluation of speech quality (PESQ) when applied in the STFT and STDCT domains, with comparable short-time objective intelligibility (STOI) scores across all domains. Notably, the GLFB-based U-Net outperforms its CCAB counterpart in metrics such as CSIG, CBAK, COVL, and PESQ, while maintaining fewer learnable parameters. Furthermore, we propose domain-specific composite loss functions, considering the acoustic and perceptual characteristics of the spectral domain, to enhance the perceptual quality of denoised speech. Our findings provide valuable insights that can guide the optimization of DNN designs for causal speech denoising. Full article

(This article belongs to the Section Information Systems)

► Show Figures

Figure 1

14 pages, 1887 KB

Open AccessArticle

Assessment of Self-Supervised Denoising Methods for Esophageal Speech Enhancement

by Madiha Amarjouf, El Hassan Ibn Elhaj, Mouhcine Chami, Kadria Ezzine and Joseph Di Martino

Appl. Sci. 2024, 14(15), 6682; https://doi.org/10.3390/app14156682 - 31 Jul 2024

Viewed by 1399

Abstract

Esophageal speech (ES) is a pathological voice that is often difficult to understand. Moreover, acquiring recordings of a patient’s voice before a laryngectomy proves challenging, thereby complicating enhancing this kind of voice. That is why most supervised methods used to enhance ES are [...] Read more.

Esophageal speech (ES) is a pathological voice that is often difficult to understand. Moreover, acquiring recordings of a patient’s voice before a laryngectomy proves challenging, thereby complicating enhancing this kind of voice. That is why most supervised methods used to enhance ES are based on voice conversion, which uses healthy speaker targets, things that may not preserve the speaker’s identity. Otherwise, unsupervised methods for ES are mostly based on traditional filters, which cannot alone beat this kind of noise, making the denoising process difficult. Also, these methods are known for producing musical artifacts. To address these issues, a self-supervised method based on the Only-Noisy-Training (ONT) model was applied, consisting of denoising a signal without needing a clean target. Four experiments were conducted using Deep Complex UNET (DCUNET) and Deep Complex UNET with Complex Two-Stage Transformer Module (DCUNET-cTSTM) for assessment. Both of these models are based on the ONT approach. Also, for comparison purposes and to calculate the evaluation metrics, the pre-trained VoiceFixer model was used to restore the clean wave files of esophageal speech. Even with the fact that ONT-based methods work better with noisy wave files, the results have proven that ES can be denoised without the need for clean targets, and hence, the speaker’s identity is retained. Full article

► Show Figures

Figure 1

12 pages, 973 KB

Open AccessArticle

Speech Emotion Recognition under Noisy Environments with SNR Down to −6 dB Using Multi-Decoder Wave-U-Net

by Hyun-Joon Nam and Hong-June Park

Appl. Sci. 2024, 14(12), 5227; https://doi.org/10.3390/app14125227 - 17 Jun 2024

Cited by 4 | Viewed by 2431

Abstract

A speech emotion recognition (SER) model for noisy environments is proposed, by using four band-pass filtered speech waveforms as the model input instead of the simplified input features such as MFCC (Mel Frequency Cepstral Coefficients). The four waveforms retain the entire information of [...] Read more.

A speech emotion recognition (SER) model for noisy environments is proposed, by using four band-pass filtered speech waveforms as the model input instead of the simplified input features such as MFCC (Mel Frequency Cepstral Coefficients). The four waveforms retain the entire information of the original noisy speech while the simplified features keep only partial information of the noisy speech. The information reduction at the model input may cause the accuracy degradation under noisy environments. A normalized loss function is used for training to maintain the high-frequency details of the original noisy speech waveform. A multi-decoder Wave-U-Net model is used to perform the denoising operation and the Wave-U-Net output waveform is applied to an emotion classifier in this work. By this, the number of parameters is reduced to 2.8 M for inference from 4.2 M used for training. The Wave-U-Net model consists of an encoder, a 2-layer LSTM, six decoders, and skip-nets; out of the six decoders, four decoders are used for denoising four band-pass filtered waveforms, one decoder is used for denoising the pitch-related waveform, and one decoder is used to generate the emotion classifier input waveform. This work gives much less accuracy degradation than other SER works under noisy environments; compared to accuracy for the clean speech waveform, the accuracy degradation is 3.8% at 0 dB SNR in this work while it is larger than 15% in the other SER works. The accuracy degradation of this work at SNRs of 0 dB, −3 dB, and −6 dB is 3.8%, 5.2%, and 7.2%, respectively. Full article

(This article belongs to the Special Issue Advanced Technologies for Emotion Recognition)

► Show Figures

Figure 1

13 pages, 2067 KB

Open AccessArticle

A Dual-Branch Speech Enhancement Model with Harmonic Repair

by Lizhen Jia, Yanyan Xu and Dengfeng Ke

Appl. Sci. 2024, 14(4), 1645; https://doi.org/10.3390/app14041645 - 18 Feb 2024

Viewed by 2233

Abstract

Recent speech enhancement studies have mostly focused on completely separating noise from human voices. Due to the lack of specific structures for harmonic fitting in previous studies and the limitations of the traditional convolutional receptive field, there is an inevitable decline in the [...] Read more.

Recent speech enhancement studies have mostly focused on completely separating noise from human voices. Due to the lack of specific structures for harmonic fitting in previous studies and the limitations of the traditional convolutional receptive field, there is an inevitable decline in the auditory quality of the enhanced speech, leading to a decrease in the performance of subsequent tasks such as speech recognition and speaker identification. To address these problems, this paper proposes a Harmonic Repair Large Frame enhancement model, called HRLF-Net, that uses a harmonic repair network for denoising, followed by a real-imaginary dual branch structure for restoration. This approach fully utilizes the harmonic overtones to match the original harmonic distribution of speech. In the subsequent branch process, it restores the speech to specifically optimize its auditory quality to the human ear. Experiments show that under HRLF-Net, the intelligibility and quality of speech are significantly improved, and harmonic information is effectively restored. Full article

(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)

► Show Figures

Figure 1

12 pages, 882 KB

Open AccessArticle

Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder

by Chae-Woon Bang and Chanjun Chun

Sensors 2023, 23(23), 9591; https://doi.org/10.3390/s23239591 - 3 Dec 2023

Cited by 3 | Viewed by 3603

Abstract

Speech synthesis is a technology that converts text into speech waveforms. With the development of deep learning, neural network-based speech synthesis technology is being researched in various fields, and the quality of synthesized speech has significantly improved. In particular, Grad-TTS, a speech synthesis [...] Read more.

Speech synthesis is a technology that converts text into speech waveforms. With the development of deep learning, neural network-based speech synthesis technology is being researched in various fields, and the quality of synthesized speech has significantly improved. In particular, Grad-TTS, a speech synthesis model based on the denoising diffusion probabilistic model (DDPM), exhibits high performance in various domains, generates high-quality speech, and supports multi-speaker speech synthesis. However, speech synthesis for an unseen speaker is not possible. Therefore, this study proposes an effective zero-shot multi-speaker speech synthesis model that improves the Grad-TTS structure. The proposed method enables the reception of speaker information from speech references using a pre-trained speaker recognition model. In addition, by converting speaker information via information perturbation, the model can learn various types of speaker information, excluding those in the dataset. To evaluate the performance of the proposed method, we measured objective performance indicators, namely speaker encoder cosine similarity (SECS) and mean opinion score (MOS). To evaluate the synthesis performance for both the seen speaker and unseen speaker scenarios, Grad-TTS, SC-GlowTTS, and YourTTS were compared. The results demonstrated excellent speech synthesis performance for seen speakers and a performance similar to that of the zero-shot multi-speaker speech synthesis model. Full article

(This article belongs to the Section Intelligent Sensors)

► Show Figures

Figure 1

33 pages, 8698 KB

Open AccessArticle

Welding Penetration Monitoring for Ship Robotic GMAW Using Arc Sound Sensing Based on Improved Wavelet Denoising

by Ziquan Jiao, Tongshuai Yang, Xingyu Gao, Shanben Chen and Wenjing Liu

Machines 2023, 11(9), 911; https://doi.org/10.3390/machines11090911 - 16 Sep 2023

Cited by 5 | Viewed by 2042

Abstract

The arc sound signal is one of the most important aspects of information related to pattern identification regarding the penetration state of ship robotic GMAW; however, arc sound is inevitably affected by noise interference during the signal acquisition process. In this paper, an [...] Read more.

The arc sound signal is one of the most important aspects of information related to pattern identification regarding the penetration state of ship robotic GMAW; however, arc sound is inevitably affected by noise interference during the signal acquisition process. In this paper, an improved wavelet threshold denoising method is proposed to eliminate interference and purify the arc sound signal. The non-stationary random distribution characteristics of GMAW noise interference are also estimated by using the high-frequency detail coefficients in different domains after wavelet transformation, and a mode of measuring scale that is logarithmically negatively correlated with the wavelet decomposition scale is created to update the threshold. The gradient convergent threshold function is established using the natural logarithmic function structure and concave–convex gradient to enable the nonlinear adjustment of the asymptotic rate. Further, some property theorems related to the optimized threshold function are proposed and theoretically proven, and the effectiveness and adaptability of the improved method are verified via the denoising simulation of speech synthesis signals. The four traditional denoising methods and our improved version are applied in the pretreatment of the GMAW arc sound signal, respectively. Statistical analysis and short-time Fourier transform are used to extract eight-dimensional time and frequency domain feature parameters from the denoised signals with randomly time-varying characteristics, and the extracted joint feature parameters are used to establish a nonlinear mapping model of penetration state identification for ship robotic GMAW using the pattern classifiers of RBFNN, PNN and PSO-SVM. The simulation results yielded by visual penetration classification and the multi-dimensional evaluation index of the confusion matrix indicate that the improved denoising method proposed in this paper achieves a higher accuracy in the extraction of penetration state features and greater precision in the identification of pattern classification. Full article

(This article belongs to the Special Issue Recent Applications in Non-destructive Testing (NDT))

► Show Figures

Figure 1

21 pages, 3582 KB

Open AccessArticle

Speech Enhancement Based on Enhanced Empirical Wavelet Transform and Teager Energy Operator

by Piotr Kuwałek and Waldemar Jęśko

Electronics 2023, 12(14), 3167; https://doi.org/10.3390/electronics12143167 - 21 Jul 2023

Cited by 2 | Viewed by 1518

Abstract

This paper presents a new speech-enhancement approach based on an enhanced empirical wavelet transform, considering the time and scale adaptation of thresholds for individual component signals obtained from the used transform. The time adaptation is performed using the Teager energy operator on the [...] Read more.

This paper presents a new speech-enhancement approach based on an enhanced empirical wavelet transform, considering the time and scale adaptation of thresholds for individual component signals obtained from the used transform. The time adaptation is performed using the Teager energy operator on the individual component signals, and the scale adaptation of thresholds is performed by the modified level-dependent threshold principle for the individual component signals. The proposed approach does not require an explicit estimation of the noise level or a priori knowledge of the signal-to-noise ratio as is usually needed in most common speech-enhancement methods. The effectiveness of the proposed method has been assessed based on over 1000 speech recordings from the public Librispeech database. The research included various types of noise (among others white, violet, brown, blue, and pink) and various types of disturbance (among others traffic sounds, hair dryer, and fan), which were added to the selected test signals. The score of perceptual evaluation of speech quality, allowing for the assessment of the quality of enhanced speech, and signal-to-noise ratio, allowing for the assessment of the effectiveness of disturbance attenuation, are selected for the evaluation of the resultant effectiveness of the proposed approach. The resultant effectiveness of the proposed approach is compared with other selected speech-enhancement methods or denoising techniques available in the literature. The experimental research results show that the proposed method performs better than conventional methods in many types of high-noise conditions in terms of producing less residual noise and lower speech distortion. Full article

► Show Figures

Figure 1

14 pages, 6266 KB

Open AccessArticle

Supervised Single Channel Speech Enhancement Method Using UNET

by Md. Nahid Hossain, Samiul Basir, Md. Shakhawat Hosen, A.O.M. Asaduzzaman, Md. Mojahidul Islam, Mohammad Alamgir Hossain and Md Shohidul Islam

Electronics 2023, 12(14), 3052; https://doi.org/10.3390/electronics12143052 - 12 Jul 2023

Cited by 9 | Viewed by 4230

Abstract

This paper proposes an innovative single-channel supervised speech enhancement (SE) method based on UNET, a convolutional neural network (CNN) architecture that expands on a few changes in the basic CNN architecture. In the training phase, short-time Fourier transform (STFT) is exploited on the [...] Read more.

This paper proposes an innovative single-channel supervised speech enhancement (SE) method based on UNET, a convolutional neural network (CNN) architecture that expands on a few changes in the basic CNN architecture. In the training phase, short-time Fourier transform (STFT) is exploited on the noisy time domain signal to build a noisy time-frequency domain signal which is called a complex noisy matrix. We take the real and imaginary parts of the complex noisy matrix and concatenate both of them to form the noisy concatenated matrix. We apply UNET to the noisy concatenated matrix for extracting speech components and train the CNN model. In the testing phase, the same procedure is applied to the noisy time-domain signal as in the training phase in order to construct another noisy concatenated matrix that can be tested using a pre-trained or saved model in order to construct an enhanced concatenated matrix. Finally, from the enhanced concatenated matrix, we separate both the imaginary and real parts to form an enhanced complex matrix. Magnitude and phase are then extracted from the newly created enhanced complex matrix. By using that magnitude and phase, the inverse STFT (ISTFT) can generate the enhanced speech signal. Utilizing the IEEE databases and various types of noise, including stationary and non-stationary noise, the proposed method is evaluated. Comparing the exploratory results of the proposed algorithm to the other five methods of STFT, sparse non-negative matrix factorization (SNMF), dual-tree complex wavelet transform (DTCWT)-SNMF, DTCWT-STFT-SNMF, STFT-convolutional denoising auto encoder (CDAE) and casual multi-head attention mechanism (CMAM) for speech enhancement, we determine that the proposed algorithm generally improves speech quality and intelligibility at all considered signal-to-noise ratios (SNRs). The suggested approach performs better than the other five competing algorithms in every evaluation metric. Full article

(This article belongs to the Special Issue Machine Learning in Music/Audio Signal Processing)

► Show Figures

Figure 1

17 pages, 541 KB

Open AccessArticle

Target Selection Strategies for Demucs-Based Speech Enhancement

by Caleb Rascon and Gibran Fuentes-Pineda

Appl. Sci. 2023, 13(13), 7820; https://doi.org/10.3390/app13137820 - 3 Jul 2023

Cited by 3 | Viewed by 4912

Abstract

The Demucs-Denoiser model has been recently shown to achieve a high level of performance for online speech enhancement, but assumes that only one speech source is present in the fed mixture. In real-life multiple-speech-source scenarios, it is not certain which speech source will [...] Read more.

The Demucs-Denoiser model has been recently shown to achieve a high level of performance for online speech enhancement, but assumes that only one speech source is present in the fed mixture. In real-life multiple-speech-source scenarios, it is not certain which speech source will be enhanced. To correct this issue, two target selection strategies for the Demucs-Denoiser model are proposed and evaluated: (1) an embedding-based strategy, using a codified sample of the target speech, and (2) a location-based strategy, using a beamforming-based prefilter to select the target that is in front of a two-microphone array. In this work, it is shown that while both strategies improve the performance of the Demucs-Denoiser model when one or more speech interferences are present, they both have their pros and cons. Specifically, the beamforming-based strategy achieves overall a better performance (increasing the output SIR between 5 and 10 dB) compared to the embedding-based strategy (which only increases the output SIR by 2 dB and only in low-input-SIR scenarios). However, the beamforming-based strategy is sensitive against the location variation of the target speech source (decreasing the output SIR by 10 dB if the target speech source is located only 0.1 m from its expected position), which the embedding-based strategy does not suffers from. Full article

(This article belongs to the Special Issue Advances in Audio and Video Processing)

► Show Figures

Figure 1

Search Results (41)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (41)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI