Next Article in Journal
A Multi-Model Proposal for Classification and Detection of DDoS Attacks on SCADA Systems
Previous Article in Journal
Customer Complaint Analysis via Review-Based Control Charts and Dynamic Importance–Performance Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Cross-Domain Conv-TasNet Speech Enhancement Model with Two-Level Bi-Projection Fusion of Discrete Wavelet Transform †

Department of Electrical Engineering, National Chi Nan University, Nantou 54561, Taiwan
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in the Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING), Taipei, Taiwan, 21–22 November 2022.
Appl. Sci. 2023, 13(10), 5992; https://doi.org/10.3390/app13105992
Submission received: 20 March 2023 / Revised: 4 May 2023 / Accepted: 10 May 2023 / Published: 12 May 2023

Abstract

Nowadays, time-domain features see wide use in speech enhancement (SE) networks such as frequency-domain features to achieve excellent performance in eliminating noise from input utterances. This study primarily investigates how to extract information from time-domain utterances to create more effective features in SE. We extend our recent work by employing sub-signals which dwell in multiple acoustic frequency bands in the time domain and integrating them into a unified time-domain feature set. The discrete wavelet transform (DWT) is applied to decompose each input frame signal to obtain sub-band signals, and a projection fusion process is performed on these signals to create the ultimate features. The corresponding fusion strategy is either bi-projection fusion (BPF) or multiple projection fusion (MPF). In short, MPF exploits the softmax function to replace the sigmoid function in order to create ratio masks for multiple feature sources. The concatenation of fused DWT features and time features serves as the encoder output of two celebrated SE frameworks, the fully convolutional time-domain audio separation network (Conv-TasNet) and the dual-path transformer network (DPTNet), to estimate the mask and then produce the enhanced time-domain utterances. The evaluation experiments are conducted on the VoiceBank-DEMAND and VoiceBank-QUT tasks, and the results reveal that the proposed method achieves higher speech quality and intelligibility than the original Conv-TasNet that uses time features only, indicating that the fusion of DWT features created from the input utterances can benefit time features to learn a superior Conv-TasNet/DPTNet network in SE.
Keywords: speech enhancement; discrete wavelet transform; cross-domain; Conv-TasNet; bi-projection fusion; multiple projection fusion speech enhancement; discrete wavelet transform; cross-domain; Conv-TasNet; bi-projection fusion; multiple projection fusion

Share and Cite

MDPI and ACS Style

Chen, Y.-T.; Wu, Z.-T.; Hung, J.-W. Cross-Domain Conv-TasNet Speech Enhancement Model with Two-Level Bi-Projection Fusion of Discrete Wavelet Transform. Appl. Sci. 2023, 13, 5992. https://doi.org/10.3390/app13105992

AMA Style

Chen Y-T, Wu Z-T, Hung J-W. Cross-Domain Conv-TasNet Speech Enhancement Model with Two-Level Bi-Projection Fusion of Discrete Wavelet Transform. Applied Sciences. 2023; 13(10):5992. https://doi.org/10.3390/app13105992

Chicago/Turabian Style

Chen, Yan-Tong, Zong-Tai Wu, and Jeih-Weih Hung. 2023. "Cross-Domain Conv-TasNet Speech Enhancement Model with Two-Level Bi-Projection Fusion of Discrete Wavelet Transform" Applied Sciences 13, no. 10: 5992. https://doi.org/10.3390/app13105992

APA Style

Chen, Y.-T., Wu, Z.-T., & Hung, J.-W. (2023). Cross-Domain Conv-TasNet Speech Enhancement Model with Two-Level Bi-Projection Fusion of Discrete Wavelet Transform. Applied Sciences, 13(10), 5992. https://doi.org/10.3390/app13105992

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop