Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (41)

Search Parameters:
Keywords = speech denoising

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
20 pages, 3941 KB  
Article
Self-Supervised Voice Denoising Network for Multi-Scenario Human–Robot Interaction
by Mu Li, Wenjin Xu, Chao Zeng and Ning Wang
Biomimetics 2025, 10(9), 603; https://doi.org/10.3390/biomimetics10090603 - 9 Sep 2025
Viewed by 418
Abstract
Human–robot interaction (HRI) via voice command has significantly advanced in recent years, with large Vision–Language–Action (VLA) models demonstrating particular promise in human–robot voice interaction. However, these systems still struggle with environmental noise contamination during voice interaction and lack a specialized denoising network for [...] Read more.
Human–robot interaction (HRI) via voice command has significantly advanced in recent years, with large Vision–Language–Action (VLA) models demonstrating particular promise in human–robot voice interaction. However, these systems still struggle with environmental noise contamination during voice interaction and lack a specialized denoising network for multi-speaker command isolation in an overlapping speech scenario. To overcome these challenges, we introduce a method to enhance voice command-based HRI in noisy environments, leveraging synthetic data and a self-supervised denoising network to enhance its real-world applicability. Our approach focuses on improving self-supervised network performance in denoising mixed-noise audio through training data scaling. Extensive experiments show our method outperforms existing approaches in simulation and achieves 7.5% higher accuracy than the state-of-the-art method in noisy real-world environments, enhancing voice-guided robot control. Full article
(This article belongs to the Special Issue Intelligent Human–Robot Interaction: 4th Edition)
Show Figures

Figure 1

15 pages, 1508 KB  
Article
Simultaneous Speech Denoising and Super-Resolution Using mGLFB-Based U-Net, Fine-Tuned via Perceptual Loss
by Hwai-Tsu Hu and Hao-Hsuan Tsai
Electronics 2025, 14(17), 3466; https://doi.org/10.3390/electronics14173466 - 29 Aug 2025
Viewed by 448
Abstract
This paper presents an efficient U-Net architecture featuring a modified Global Local Former Block (mGLFB) for simultaneous speech denoising and resolution reconstruction. Optimized for computational efficiency in the discrete cosine transform domain, the proposed architecture reduces model size by 13.5% compared to a [...] Read more.
This paper presents an efficient U-Net architecture featuring a modified Global Local Former Block (mGLFB) for simultaneous speech denoising and resolution reconstruction. Optimized for computational efficiency in the discrete cosine transform domain, the proposed architecture reduces model size by 13.5% compared to a standard GLFB-based U-Net, while maintaining comparable performance across multiple quality metrics. In addition to the mGLFB redesign, we introduce a perceptual loss that better captures high-frequency magnitude spectra, yielding notable gains in high-resolution recovery, especially in unvoiced speech segments. However, the mGLFB-based U-Net still shows limitations in retrieving spectral details with substantial energy in 4–6 kHz frequencies. Full article
Show Figures

Figure 1

19 pages, 1711 KB  
Article
TSDCA-BA: An Ultra-Lightweight Speech Enhancement Model for Real-Time Hearing Aids with Multi-Scale STFT Fusion
by Zujie Fan, Zikun Guo, Yanxing Lai and Jaesoo Kim
Appl. Sci. 2025, 15(15), 8183; https://doi.org/10.3390/app15158183 - 23 Jul 2025
Viewed by 1112
Abstract
Lightweight speech denoising models have made remarkable progress in improving both speech quality and computational efficiency. However, most models rely on long temporal windows as input, limiting their applicability in low-latency, real-time scenarios on edge devices. To address this challenge, we propose a [...] Read more.
Lightweight speech denoising models have made remarkable progress in improving both speech quality and computational efficiency. However, most models rely on long temporal windows as input, limiting their applicability in low-latency, real-time scenarios on edge devices. To address this challenge, we propose a lightweight hybrid module, Temporal Statistics Enhancement, Squeeze-and-Excitation-based Dual Convolutional Attention, and Band-wise Attention (TSE, SDCA, BA) Module. The TSE module enhances single-frame spectral features by concatenating statistical descriptors—mean, standard deviation, maximum, and minimum—thereby capturing richer local information without relying on temporal context. The SDCA and BA module integrates a simplified residual structure and channel attention, while the BA component further strengthens the representation of critical frequency bands through band-wise partitioning and differentiated weighting. The proposed model requires only 0.22 million multiply–accumulate operations (MMACs) and contains a total of 112.3 K parameters, making it well suited for low-latency, real-time speech enhancement applications. Experimental results demonstrate that among lightweight models with fewer than 200K parameters, the proposed approach outperforms most existing methods in both denoising performance and computational efficiency, significantly reducing processing overhead. Furthermore, real-device deployment on an improved hearing aid confirms an inference latency as low as 2 milliseconds, validating its practical potential for real-time edge applications. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

25 pages, 1964 KB  
Article
Hate Speech Detection and Online Public Opinion Regulation Using Support Vector Machine Algorithm: Application and Impact on Social Media
by Siyuan Li and Zhi Li
Information 2025, 16(5), 344; https://doi.org/10.3390/info16050344 - 24 Apr 2025
Viewed by 1147
Abstract
Detecting hate speech in social media is challenging due to its rarity, high-dimensional complexity, and implicit expression via sarcasm or spelling variations, rendering linear models ineffective. In this study, the SVM (Support Vector Machine) algorithm is used to map text features from low-dimensional [...] Read more.
Detecting hate speech in social media is challenging due to its rarity, high-dimensional complexity, and implicit expression via sarcasm or spelling variations, rendering linear models ineffective. In this study, the SVM (Support Vector Machine) algorithm is used to map text features from low-dimensional to high-dimensional space using kernel function techniques to meet complex nonlinear classification challenges. By maximizing the category interval to locate the optimal hyperplane and combining nuclear techniques to implicitly adjust the data distribution, the classification accuracy of hate speech detection is significantly improved. Data collection leverages social media APIs (Application Programming Interface) and customized crawlers with OAuth2.0 authentication and keyword filtering, ensuring relevance. Regular expressions validate data integrity, followed by preprocessing steps such as denoising, stop-word removal, and spelling correction. Word embeddings are generated using Word2Vec’s Skip-gram model, combined with TF-IDF (Term Frequency–Inverse Document Frequency) weighting to capture contextual semantics. A multi-level feature extraction framework integrates sentiment analysis via lexicon-based methods and BERT for advanced sentiment recognition. Experimental evaluations on two datasets demonstrate the SVM model’s effectiveness, achieving accuracies of 90.42% and 92.84%, recall rates of 88.06% and 90.79%, and average inference times of 3.71 ms and 2.96 ms. These results highlight the model’s ability to detect implicit hate speech accurately and efficiently, supporting real-time monitoring. This research contributes to creating a safer online environment by advancing hate speech detection methodologies. Full article
(This article belongs to the Special Issue Information Technology in Society)
Show Figures

Figure 1

21 pages, 6196 KB  
Article
Building a Gender-Bias-Resistant Super Corpus as a Deep Learning Baseline for Speech Emotion Recognition
by Babak Abbaschian and Adel Elmaghraby
Sensors 2025, 25(7), 1991; https://doi.org/10.3390/s25071991 - 22 Mar 2025
Viewed by 756
Abstract
The focus on Speech Emotion Recognition has dramatically increased in recent years, driven by the need for automatic speech-recognition-based systems and intelligent assistants to enhance user experience by incorporating emotional content. While deep learning techniques have significantly advanced SER systems, their robustness concerning [...] Read more.
The focus on Speech Emotion Recognition has dramatically increased in recent years, driven by the need for automatic speech-recognition-based systems and intelligent assistants to enhance user experience by incorporating emotional content. While deep learning techniques have significantly advanced SER systems, their robustness concerning speaker gender and out-of-distribution data has not been thoroughly examined. Furthermore, standards for SER remain rooted in landmark papers from the 2000s, even though modern deep learning architectures can achieve comparable or superior results to the state of the art of that era. In this research, we address these challenges by creating a new super corpus from existing databases, providing a larger pool of samples. We benchmark this dataset using various deep learning architectures, setting a new baseline for the task. Additionally, our experiments reveal that models trained on this super corpus demonstrate superior generalization and accuracy and exhibit lower gender bias compared to models trained on individual databases. We further show that traditional preprocessing techniques, such as denoising and normalization, are insufficient to address inherent biases in the data. However, our data augmentation approach effectively shifts these biases, improving model fairness across gender groups and emotions and, in some cases, fully debiasing the models. Full article
(This article belongs to the Special Issue Emotion Recognition and Cognitive Behavior Analysis Based on Sensors)
Show Figures

Graphical abstract

13 pages, 2233 KB  
Article
High-Quality Text-to-Speech Implementation via Active Shallow Diffusion Mechanism
by Junlin Deng, Ruihan Hou, Yan Deng, Yongqiu Long and Ning Wu
Sensors 2025, 25(3), 833; https://doi.org/10.3390/s25030833 - 30 Jan 2025
Viewed by 2469
Abstract
Denoising diffusion probabilistic models (DDPMs) have proven to be useful in text-to-speech (TTS) tasks; however, it has been a challenge for traditional diffusion models to carry out real-time processing because of the need for hundreds of sampling steps during the iteration. In this [...] Read more.
Denoising diffusion probabilistic models (DDPMs) have proven to be useful in text-to-speech (TTS) tasks; however, it has been a challenge for traditional diffusion models to carry out real-time processing because of the need for hundreds of sampling steps during the iteration. In this work, a two-stage fast inference and efficient diffusion-based acoustic model of TTS, the Cascaded MixGAN-TTS (CMG-TTS), is proposed to address this problem. An active shallow diffusion mechanism is adopted to divide the CMG-TTS training process into two stages. Specifically, a basic acoustic model in the first stage is trained to provide valuable a priori knowledge for the second stage, and for the underlying acoustic modeling, a mixture combination mechanism-based linguistic encoder is introduced to work with pitch and energy predictors. In the following stage of processing, a post-net is used to optimize the mel-spectrogram reconstruction performance. The CMG-TTS is evaluated on datasets such as the AISHELL3 and LJSpeech, and the experiments show that the CMG-TTS achieves satisfactory results in both subjective and objective evaluation metrics with only one denoising step. Compared to other TTS models based on diffusion modeling, the CMG-TTS obtains a leading score in the real time factor (RTF), and both stages of the CMG-TTS are effective in the ablation studies. Full article
(This article belongs to the Special Issue Sensors and Machine-Learning Based Signal Processing)
Show Figures

Figure 1

15 pages, 4088 KB  
Article
Options for Performing DNN-Based Causal Speech Denoising Using the U-Net Architecture
by Hwai-Tsu Hu and Tung-Tsun Lee
Appl. Syst. Innov. 2024, 7(6), 120; https://doi.org/10.3390/asi7060120 - 29 Nov 2024
Viewed by 1840
Abstract
Speech enhancement technology seeks to improve the quality and intelligibility of speech signals degraded by noise, particularly in telephone communications. Recent advancements have focused on leveraging deep neural networks (DNN), especially U-Net architectures, for effective denoising. In this study, we evaluate the performance [...] Read more.
Speech enhancement technology seeks to improve the quality and intelligibility of speech signals degraded by noise, particularly in telephone communications. Recent advancements have focused on leveraging deep neural networks (DNN), especially U-Net architectures, for effective denoising. In this study, we evaluate the performance of a 6-level skip-connected U-Net constructed using either conventional convolution activation blocks (CCAB) or innovative global local former blocks (GLFB) across different processing domains: temporal waveform, short-time Fourier transform (STFT), and short-time discrete cosine transform (STDCT). Our results indicate that the U-Nets can receive better signal-to-noise ratio (SNR) and perceptual evaluation of speech quality (PESQ) when applied in the STFT and STDCT domains, with comparable short-time objective intelligibility (STOI) scores across all domains. Notably, the GLFB-based U-Net outperforms its CCAB counterpart in metrics such as CSIG, CBAK, COVL, and PESQ, while maintaining fewer learnable parameters. Furthermore, we propose domain-specific composite loss functions, considering the acoustic and perceptual characteristics of the spectral domain, to enhance the perceptual quality of denoised speech. Our findings provide valuable insights that can guide the optimization of DNN designs for causal speech denoising. Full article
(This article belongs to the Section Information Systems)
Show Figures

Figure 1

14 pages, 1887 KB  
Article
Assessment of Self-Supervised Denoising Methods for Esophageal Speech Enhancement
by Madiha Amarjouf, El Hassan Ibn Elhaj, Mouhcine Chami, Kadria Ezzine and Joseph Di Martino
Appl. Sci. 2024, 14(15), 6682; https://doi.org/10.3390/app14156682 - 31 Jul 2024
Viewed by 1399
Abstract
Esophageal speech (ES) is a pathological voice that is often difficult to understand. Moreover, acquiring recordings of a patient’s voice before a laryngectomy proves challenging, thereby complicating enhancing this kind of voice. That is why most supervised methods used to enhance ES are [...] Read more.
Esophageal speech (ES) is a pathological voice that is often difficult to understand. Moreover, acquiring recordings of a patient’s voice before a laryngectomy proves challenging, thereby complicating enhancing this kind of voice. That is why most supervised methods used to enhance ES are based on voice conversion, which uses healthy speaker targets, things that may not preserve the speaker’s identity. Otherwise, unsupervised methods for ES are mostly based on traditional filters, which cannot alone beat this kind of noise, making the denoising process difficult. Also, these methods are known for producing musical artifacts. To address these issues, a self-supervised method based on the Only-Noisy-Training (ONT) model was applied, consisting of denoising a signal without needing a clean target. Four experiments were conducted using Deep Complex UNET (DCUNET) and Deep Complex UNET with Complex Two-Stage Transformer Module (DCUNET-cTSTM) for assessment. Both of these models are based on the ONT approach. Also, for comparison purposes and to calculate the evaluation metrics, the pre-trained VoiceFixer model was used to restore the clean wave files of esophageal speech. Even with the fact that ONT-based methods work better with noisy wave files, the results have proven that ES can be denoised without the need for clean targets, and hence, the speaker’s identity is retained. Full article
Show Figures

Figure 1

12 pages, 973 KB  
Article
Speech Emotion Recognition under Noisy Environments with SNR Down to −6 dB Using Multi-Decoder Wave-U-Net
by Hyun-Joon Nam and Hong-June Park
Appl. Sci. 2024, 14(12), 5227; https://doi.org/10.3390/app14125227 - 17 Jun 2024
Cited by 4 | Viewed by 2431
Abstract
A speech emotion recognition (SER) model for noisy environments is proposed, by using four band-pass filtered speech waveforms as the model input instead of the simplified input features such as MFCC (Mel Frequency Cepstral Coefficients). The four waveforms retain the entire information of [...] Read more.
A speech emotion recognition (SER) model for noisy environments is proposed, by using four band-pass filtered speech waveforms as the model input instead of the simplified input features such as MFCC (Mel Frequency Cepstral Coefficients). The four waveforms retain the entire information of the original noisy speech while the simplified features keep only partial information of the noisy speech. The information reduction at the model input may cause the accuracy degradation under noisy environments. A normalized loss function is used for training to maintain the high-frequency details of the original noisy speech waveform. A multi-decoder Wave-U-Net model is used to perform the denoising operation and the Wave-U-Net output waveform is applied to an emotion classifier in this work. By this, the number of parameters is reduced to 2.8 M for inference from 4.2 M used for training. The Wave-U-Net model consists of an encoder, a 2-layer LSTM, six decoders, and skip-nets; out of the six decoders, four decoders are used for denoising four band-pass filtered waveforms, one decoder is used for denoising the pitch-related waveform, and one decoder is used to generate the emotion classifier input waveform. This work gives much less accuracy degradation than other SER works under noisy environments; compared to accuracy for the clean speech waveform, the accuracy degradation is 3.8% at 0 dB SNR in this work while it is larger than 15% in the other SER works. The accuracy degradation of this work at SNRs of 0 dB, −3 dB, and −6 dB is 3.8%, 5.2%, and 7.2%, respectively. Full article
(This article belongs to the Special Issue Advanced Technologies for Emotion Recognition)
Show Figures

Figure 1

13 pages, 2067 KB  
Article
A Dual-Branch Speech Enhancement Model with Harmonic Repair
by Lizhen Jia, Yanyan Xu and Dengfeng Ke
Appl. Sci. 2024, 14(4), 1645; https://doi.org/10.3390/app14041645 - 18 Feb 2024
Viewed by 2233
Abstract
Recent speech enhancement studies have mostly focused on completely separating noise from human voices. Due to the lack of specific structures for harmonic fitting in previous studies and the limitations of the traditional convolutional receptive field, there is an inevitable decline in the [...] Read more.
Recent speech enhancement studies have mostly focused on completely separating noise from human voices. Due to the lack of specific structures for harmonic fitting in previous studies and the limitations of the traditional convolutional receptive field, there is an inevitable decline in the auditory quality of the enhanced speech, leading to a decrease in the performance of subsequent tasks such as speech recognition and speaker identification. To address these problems, this paper proposes a Harmonic Repair Large Frame enhancement model, called HRLF-Net, that uses a harmonic repair network for denoising, followed by a real-imaginary dual branch structure for restoration. This approach fully utilizes the harmonic overtones to match the original harmonic distribution of speech. In the subsequent branch process, it restores the speech to specifically optimize its auditory quality to the human ear. Experiments show that under HRLF-Net, the intelligibility and quality of speech are significantly improved, and harmonic information is effectively restored. Full article
(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)
Show Figures

Figure 1

12 pages, 882 KB  
Article
Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder
by Chae-Woon Bang and Chanjun Chun
Sensors 2023, 23(23), 9591; https://doi.org/10.3390/s23239591 - 3 Dec 2023
Cited by 3 | Viewed by 3603
Abstract
Speech synthesis is a technology that converts text into speech waveforms. With the development of deep learning, neural network-based speech synthesis technology is being researched in various fields, and the quality of synthesized speech has significantly improved. In particular, Grad-TTS, a speech synthesis [...] Read more.
Speech synthesis is a technology that converts text into speech waveforms. With the development of deep learning, neural network-based speech synthesis technology is being researched in various fields, and the quality of synthesized speech has significantly improved. In particular, Grad-TTS, a speech synthesis model based on the denoising diffusion probabilistic model (DDPM), exhibits high performance in various domains, generates high-quality speech, and supports multi-speaker speech synthesis. However, speech synthesis for an unseen speaker is not possible. Therefore, this study proposes an effective zero-shot multi-speaker speech synthesis model that improves the Grad-TTS structure. The proposed method enables the reception of speaker information from speech references using a pre-trained speaker recognition model. In addition, by converting speaker information via information perturbation, the model can learn various types of speaker information, excluding those in the dataset. To evaluate the performance of the proposed method, we measured objective performance indicators, namely speaker encoder cosine similarity (SECS) and mean opinion score (MOS). To evaluate the synthesis performance for both the seen speaker and unseen speaker scenarios, Grad-TTS, SC-GlowTTS, and YourTTS were compared. The results demonstrated excellent speech synthesis performance for seen speakers and a performance similar to that of the zero-shot multi-speaker speech synthesis model. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

33 pages, 8698 KB  
Article
Welding Penetration Monitoring for Ship Robotic GMAW Using Arc Sound Sensing Based on Improved Wavelet Denoising
by Ziquan Jiao, Tongshuai Yang, Xingyu Gao, Shanben Chen and Wenjing Liu
Machines 2023, 11(9), 911; https://doi.org/10.3390/machines11090911 - 16 Sep 2023
Cited by 5 | Viewed by 2042
Abstract
The arc sound signal is one of the most important aspects of information related to pattern identification regarding the penetration state of ship robotic GMAW; however, arc sound is inevitably affected by noise interference during the signal acquisition process. In this paper, an [...] Read more.
The arc sound signal is one of the most important aspects of information related to pattern identification regarding the penetration state of ship robotic GMAW; however, arc sound is inevitably affected by noise interference during the signal acquisition process. In this paper, an improved wavelet threshold denoising method is proposed to eliminate interference and purify the arc sound signal. The non-stationary random distribution characteristics of GMAW noise interference are also estimated by using the high-frequency detail coefficients in different domains after wavelet transformation, and a mode of measuring scale that is logarithmically negatively correlated with the wavelet decomposition scale is created to update the threshold. The gradient convergent threshold function is established using the natural logarithmic function structure and concave–convex gradient to enable the nonlinear adjustment of the asymptotic rate. Further, some property theorems related to the optimized threshold function are proposed and theoretically proven, and the effectiveness and adaptability of the improved method are verified via the denoising simulation of speech synthesis signals. The four traditional denoising methods and our improved version are applied in the pretreatment of the GMAW arc sound signal, respectively. Statistical analysis and short-time Fourier transform are used to extract eight-dimensional time and frequency domain feature parameters from the denoised signals with randomly time-varying characteristics, and the extracted joint feature parameters are used to establish a nonlinear mapping model of penetration state identification for ship robotic GMAW using the pattern classifiers of RBFNN, PNN and PSO-SVM. The simulation results yielded by visual penetration classification and the multi-dimensional evaluation index of the confusion matrix indicate that the improved denoising method proposed in this paper achieves a higher accuracy in the extraction of penetration state features and greater precision in the identification of pattern classification. Full article
(This article belongs to the Special Issue Recent Applications in Non-destructive Testing (NDT))
Show Figures

Figure 1

21 pages, 3582 KB  
Article
Speech Enhancement Based on Enhanced Empirical Wavelet Transform and Teager Energy Operator
by Piotr Kuwałek and Waldemar Jęśko
Electronics 2023, 12(14), 3167; https://doi.org/10.3390/electronics12143167 - 21 Jul 2023
Cited by 2 | Viewed by 1518
Abstract
This paper presents a new speech-enhancement approach based on an enhanced empirical wavelet transform, considering the time and scale adaptation of thresholds for individual component signals obtained from the used transform. The time adaptation is performed using the Teager energy operator on the [...] Read more.
This paper presents a new speech-enhancement approach based on an enhanced empirical wavelet transform, considering the time and scale adaptation of thresholds for individual component signals obtained from the used transform. The time adaptation is performed using the Teager energy operator on the individual component signals, and the scale adaptation of thresholds is performed by the modified level-dependent threshold principle for the individual component signals. The proposed approach does not require an explicit estimation of the noise level or a priori knowledge of the signal-to-noise ratio as is usually needed in most common speech-enhancement methods. The effectiveness of the proposed method has been assessed based on over 1000 speech recordings from the public Librispeech database. The research included various types of noise (among others white, violet, brown, blue, and pink) and various types of disturbance (among others traffic sounds, hair dryer, and fan), which were added to the selected test signals. The score of perceptual evaluation of speech quality, allowing for the assessment of the quality of enhanced speech, and signal-to-noise ratio, allowing for the assessment of the effectiveness of disturbance attenuation, are selected for the evaluation of the resultant effectiveness of the proposed approach. The resultant effectiveness of the proposed approach is compared with other selected speech-enhancement methods or denoising techniques available in the literature. The experimental research results show that the proposed method performs better than conventional methods in many types of high-noise conditions in terms of producing less residual noise and lower speech distortion. Full article
Show Figures

Figure 1

14 pages, 6266 KB  
Article
Supervised Single Channel Speech Enhancement Method Using UNET
by Md. Nahid Hossain, Samiul Basir, Md. Shakhawat Hosen, A.O.M. Asaduzzaman, Md. Mojahidul Islam, Mohammad Alamgir Hossain and Md Shohidul Islam
Electronics 2023, 12(14), 3052; https://doi.org/10.3390/electronics12143052 - 12 Jul 2023
Cited by 9 | Viewed by 4230
Abstract
This paper proposes an innovative single-channel supervised speech enhancement (SE) method based on UNET, a convolutional neural network (CNN) architecture that expands on a few changes in the basic CNN architecture. In the training phase, short-time Fourier transform (STFT) is exploited on the [...] Read more.
This paper proposes an innovative single-channel supervised speech enhancement (SE) method based on UNET, a convolutional neural network (CNN) architecture that expands on a few changes in the basic CNN architecture. In the training phase, short-time Fourier transform (STFT) is exploited on the noisy time domain signal to build a noisy time-frequency domain signal which is called a complex noisy matrix. We take the real and imaginary parts of the complex noisy matrix and concatenate both of them to form the noisy concatenated matrix. We apply UNET to the noisy concatenated matrix for extracting speech components and train the CNN model. In the testing phase, the same procedure is applied to the noisy time-domain signal as in the training phase in order to construct another noisy concatenated matrix that can be tested using a pre-trained or saved model in order to construct an enhanced concatenated matrix. Finally, from the enhanced concatenated matrix, we separate both the imaginary and real parts to form an enhanced complex matrix. Magnitude and phase are then extracted from the newly created enhanced complex matrix. By using that magnitude and phase, the inverse STFT (ISTFT) can generate the enhanced speech signal. Utilizing the IEEE databases and various types of noise, including stationary and non-stationary noise, the proposed method is evaluated. Comparing the exploratory results of the proposed algorithm to the other five methods of STFT, sparse non-negative matrix factorization (SNMF), dual-tree complex wavelet transform (DTCWT)-SNMF, DTCWT-STFT-SNMF, STFT-convolutional denoising auto encoder (CDAE) and casual multi-head attention mechanism (CMAM) for speech enhancement, we determine that the proposed algorithm generally improves speech quality and intelligibility at all considered signal-to-noise ratios (SNRs). The suggested approach performs better than the other five competing algorithms in every evaluation metric. Full article
(This article belongs to the Special Issue Machine Learning in Music/Audio Signal Processing)
Show Figures

Figure 1

17 pages, 541 KB  
Article
Target Selection Strategies for Demucs-Based Speech Enhancement
by Caleb Rascon and Gibran Fuentes-Pineda
Appl. Sci. 2023, 13(13), 7820; https://doi.org/10.3390/app13137820 - 3 Jul 2023
Cited by 3 | Viewed by 4912
Abstract
The Demucs-Denoiser model has been recently shown to achieve a high level of performance for online speech enhancement, but assumes that only one speech source is present in the fed mixture. In real-life multiple-speech-source scenarios, it is not certain which speech source will [...] Read more.
The Demucs-Denoiser model has been recently shown to achieve a high level of performance for online speech enhancement, but assumes that only one speech source is present in the fed mixture. In real-life multiple-speech-source scenarios, it is not certain which speech source will be enhanced. To correct this issue, two target selection strategies for the Demucs-Denoiser model are proposed and evaluated: (1) an embedding-based strategy, using a codified sample of the target speech, and (2) a location-based strategy, using a beamforming-based prefilter to select the target that is in front of a two-microphone array. In this work, it is shown that while both strategies improve the performance of the Demucs-Denoiser model when one or more speech interferences are present, they both have their pros and cons. Specifically, the beamforming-based strategy achieves overall a better performance (increasing the output SIR between 5 and 10 dB) compared to the embedding-based strategy (which only increases the output SIR by 2 dB and only in low-input-SIR scenarios). However, the beamforming-based strategy is sensitive against the location variation of the target speech source (decreasing the output SIR by 10 dB if the target speech source is located only 0.1 m from its expected position), which the embedding-based strategy does not suffers from. Full article
(This article belongs to the Special Issue Advances in Audio and Video Processing)
Show Figures

Figure 1

Back to TopTop