Effective Monoaural Speech Separation through Convolutional Top-Down Multi-View Network
Abstract
:1. Introduction
- Deep clustering-based methods [2,3] learn embeddings from mixed audio signals that can be used to distinguish between different speakers. In [2], the spectrogram embeddings are learned through a DNN to be discriminative for clustering, and the work in [3] extends the deep clustering framework by end-to-end training with better regularization, longer temporal context, and a deeper structure.
- Time-Domain Audio Separation Network (TasNet) [4,5] circumvents the drawbacks of time-frequency representation by operating directly in the time domain to separate speech signals. TasNet employs a 1D convolutional encoder–decoder system, with the separation module as its fundamental component. This module is responsible for estimating source masks based on the encoder outputs. These masks are employed to measure the encoded mixed signal in order to separate different sources.
- The ability to capture longer contextual information has been greatly improved with the rise of self-attention mechanisms and transformer design [6], leading to even better speech separation quality. Sepformer [7] and DPTNet [8] are two examples of speech separation systems that adopt transformers. They are typically implemented in the dual-path architecture, which introduces a large number of parameters due to the fact that these systems model intra-chunk local features and inter-chunk global relationships separately.
- We propose ESC-MASD-Net, a speech separation framework that achieves state-of-the-art performance while maintaining low computation load and compact model size.
- ESC-MASD-Net is built on the SuDoRM-RF++ structure, and it considers the input feature stream’s channel, global, and local information by employing a multi-view attention (MA) block. Additionally, the framework uses a residual conformer (ResCon) block that tunes the input channel numbers flexibly for extracting information without any loss due to the residual connection arrangement. Evaluation experiments indicate that adding MA and ResCon blocks significantly improves SI-SDRi.
- We investigate the conformer layer structure presented in [15] and propose three different arrangements to adopt the conformer layer in the U-Convolution blocks in ESC-MASD-Net. The evaluation results exhibit that ESC-MASD-Net equipped with the conformer layer achieves superior performance, and thus, the conformer layer is well additive to ESC-MASD-Net.
2. SuDoRM-RF++
2.1. Encoder
2.2. Separator
2.3. Decoder
3. Presented Method: ESC-MASD-Net
3.1. Residual Conformer Block
- The branch of the ResCon block expands channels for information capture and enhancement. Furthermore, this branch uses depthwise separable convolution, which combines depth-wise and point-wise convolution to drastically reduce overall processing requirements when compared to a standard convolution network.
- The ResCon block uses a GLU (Gated Linear Unit) which operates as follows:
- The Swish function is represented as , where is a learnable parameter and is the sigmoid function. Swish has been demonstrated to outperform ReLU activation for deep neural network training. Unlike ReLU, Swish does not have the problem of terminating the learning process when the input is negative.
- The ResCon block’s residual connection branch preserves all input information, guaranteeing that the main branch only grasps information that will improve or sustain performance. Furthermore, the point-wise convolution applied to the residual connection aids in alternating the number of input channels with little computational effort.
3.2. Multi-View Attention Block
- Path Separation: The input travels through three routes in the MA block, each of which has a convolution layer that changes the channel size from C to . We use chunking with a overlap ratio for both global and local attention paths, splitting into , where P and S stand for chunk size and number of chunks, respectively. This way, long sequential features are efficiently presented due to separating the global and local information. The three perspectives of attention are described as follows.
- Channel Attention: Given the input for the channel attention path, the average and max pooling processes are used to aggregate the signal information for each channel . The pooling outputs, and , are then passed through a common densely connected network with one hidden layer having nodes, followed by the sigmoid activation function to obtain the channel attention weight :
- Global Attention: The global attention is based on the self-attention of Transformer, in which the chunk-wise representation for the global attention input is taken into account. The corresponding output is determined by multi-head self-attention (MHA):
- Local Attention: The local sequential features in each chunk are represented by local attention. On the chunked input , a depthwise convolution layer with a kernel size of is applied. Following the depthwise convolution layer, we concatenate the channel-wise average and max pooling to estimate the local attention weight as follows:
- Path Aggregation: Following the three attention paths, each output is concatenated and then passed through a convolution layer. A residual gate with sigmoid activation, hyperbolic tangent activation, and ReLU is used to process the resulting output further. Finally, a residual connection is made.
4. U-Convolutional Blocks Enhanced with Conformer Layer
- (a)
- Right before the U-Convolutional blocks;
- (b)
- At the bottom (right after mostly downsampled layer) of the first U-Convolutonal block.
- (c)
- At the bottom of all U-Convolutional blocks.
5. Experimental Setup
5.1. Data
5.2. Data Generation and Augmentation
5.3. Separation Network Configurations
5.4. Training Objectives
5.5. Evaluation Details
5.6. Programming
6. Experimental Results and Discussion
- The given ESC-MASD-Net outperforms SuDoRM-RF++ by in SI-SDRi, demonstrating the success of SC-MASD-Net in speech separation.
- Regarding an ablation study, the SI-SDRi decreases by when the MA block is removed from ESC MA-SDNet, whereas it decreases by when the ResCon block is removed from ESC-SDNet. These findings suggest that ESC-MASD-Net may benefit more from ResCon than MA in speech separation.
- ESC-MASD-Net with MA alone and ESC-MASD-Net with ResCon alone perform better than SuDoRM-RF++. As a result, MA and ResCon both have the potential to improve SuDoRM-RF++’s separation behavior.
- The inclusion of the conformer layer to ESC-MASD-Net consistently increased the SI-SDRi score, regardless of where it was in the network. This demonstrates the utility of the conformer layer.
- We obtained the best overall performance (SI-SDRi score of ) when a single conformer layer was inserted directly before the first U-Convolutional block. Adding a conformer layer to the first U-Convolutional block (at the bottom, most downsampled portion) yielded an SI-SDRi score of . Adding a conformer layer to each of the four U-Convolutional blocks, on the other hand, resulted in an SI-SDRi score of , which was a improvement over the previous example but required three additional conformer layers.
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Wang, Y.; Narayanan, A.; Wang, D. On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1849–1858. [Google Scholar] [CrossRef] [PubMed]
- Hershey, J.R.; Chen, Z.; Roux, J.L.; Watanabe, S. Deep clustering: Discriminative embeddings for segmentation and separation. arXiv 2016, arXiv:1508.04306. [Google Scholar]
- Isik, Y.; Roux, J.L.; Chen, Z.; Watanabe, S.; Hershey, J.R. Single-channel multi-speaker separation using deep clustering. arXiv 2016, arXiv:1607.0217. [Google Scholar]
- Luo, Y.; Mesgarani, N. Tasnet: Time-domain audio separation network for real-time, single-channel speech separation. arXiv 2018, arXiv:1711.00541. [Google Scholar]
- Luo, Y.; Mesgarani, N. Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1256–1266. Available online: https://dl.acm.org/doi/10.1109/TASLP.2019.2915167 (accessed on 30 March 2024). [CrossRef] [PubMed]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Subakan, C.; Ravanelli, M.; Cornell, S.; Bronzi, M.; Zhong, J. Attention is all you need in speech separation. arXiv 2021, arXiv:2010.13154. [Google Scholar]
- Chen, J.; Mao, Q.; Liu, D. Dual-path Transformer network: Direct context-aware modeling for end-to-end monaural speech separation. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; Available online: http://www.interspeech2020.org/uploadfile/pdf/Wed-2-4-6.pdf (accessed on 30 March 2024).
- Luo, Y.; Chen, Z.; Yoshioka, T. Dual-path rnn: Efficient long sequence modeling for time-domain single- channel speech separation. arXiv 2020, arXiv:1910.06379. [Google Scholar]
- Maldonado, A.; Rascon, C.; Velez, I. Lightweight online separation of the sound source of interest through BLSTM-based binary masking. arXiv 2020, arXiv:2002.11241. [Google Scholar] [CrossRef]
- Li, K.; Yang, R.; Hu, X. An efficient encoder-decoder architecture with top-down attention for speech separation. arXiv 2023, arXiv:2209.15200. [Google Scholar]
- Tzinis, E.; Wang, Z.; Smaragdis, P. Sudo rm -rf: Efficient networks for universal audio source separation. arXiv 2020, arXiv:2007.06833. [Google Scholar]
- Tzinis, E.; Wang, Z.; Jiang, X.; Smaragdis, P. Compute and memory efficient universal sound source separation. J. Signal Process. Syst. 2022, 94, 245–259. [Google Scholar] [CrossRef]
- Park, H.J.; Kang, B.H.; Shin, W.; Kim, J.S.; Han, S.W. Manner: Multi-view attention network for noise erasure. arXiv 2022, arXiv:2203.02181. [Google Scholar]
- Ravenscroft, W.; Goetze, S.; Hain, T. On time domain conformer models for monaural speech separation in noisy reverberant acoustic environments. arXiv 2023, arXiv:2310.06125. [Google Scholar]
- Wichern, G.; Antognini, J.; Flynn, M.; Zhu, L.R.; McQuinn, E.; Crow, D.; Manilow, E.; Roux, J.L. Wham!: Extending speech separation to noisy environments. arXiv 2023, arXiv:1907.01160. [Google Scholar]
- Yu, D.; Kolbæk, M.; Tan, Z.; Jensen, J. Permutation invariant training of deep models for speak-er-independent multi-talker speech separation. arXiv 2017, arXiv:1607.00325. [Google Scholar]
- Available online: https://github.com/etzinis/sudo_rm_rf (accessed on 30 March 2024).
- Available online: https://github.com/winddori2002/MANNER (accessed on 30 March 2024).
- Available online: https://github.com/jwr1995/pubsep (accessed on 30 March 2024).
- Zhao, S.; Ma, B. MossFormer: Pushing the Performance Limit of Monaural Speech Separation using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions. arXiv 2023, arXiv:2302.11824. [Google Scholar]
- Zhao, S.; Ma, Y.; Ni, C.; Zhang, C.; Wang, H.; Nguyen, T.H.; Zhou, K.; Yip, J.; Ng, D.; Ma, B. MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation. arXiv 2024, arXiv:2312.11825. [Google Scholar]
Model | SI-SDRi |
---|---|
SuDoRM-RF++ | 12.38 |
ESC-MASD-Net | 13.71 |
ESC-MASD-Net without MA block | 13.60 |
ESC-MASD-Net without ResCon block | 13.17 |
Model | SI-SDRi | |
---|---|---|
ESC-MASD-Net | 13.71 | |
ESC-MASD-Net with Conformer layer | (a) right before the U-Convblocks | 13.95 |
(b) at the bottom of the first U-Convblock | 13.80 | |
(c) at the bottom of all four U-Convblocks | 13.90 |
Model | Epoch | SI-SDRi | |
---|---|---|---|
SuDoRM-RF++ | 100 | 12.38 | |
ESC-MASD-Net | 100 | 13.71 | |
153 | 13.94 | ||
ESC-MASD-Net with a single conformer layer | (a) right before the U-Convblocks | 100 | 13.95 |
151 | 14.24 | ||
(b) at the bottom of the first U-Convblock | 100 | 13.80 | |
153 | 14.02 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Aung, A.N.; Liao, C.-W.; Hung, J.-W. Effective Monoaural Speech Separation through Convolutional Top-Down Multi-View Network. Future Internet 2024, 16, 151. https://doi.org/10.3390/fi16050151
Aung AN, Liao C-W, Hung J-W. Effective Monoaural Speech Separation through Convolutional Top-Down Multi-View Network. Future Internet. 2024; 16(5):151. https://doi.org/10.3390/fi16050151
Chicago/Turabian StyleAung, Aye Nyein, Che-Wei Liao, and Jeih-Weih Hung. 2024. "Effective Monoaural Speech Separation through Convolutional Top-Down Multi-View Network" Future Internet 16, no. 5: 151. https://doi.org/10.3390/fi16050151
APA StyleAung, A. N., Liao, C.-W., & Hung, J.-W. (2024). Effective Monoaural Speech Separation through Convolutional Top-Down Multi-View Network. Future Internet, 16(5), 151. https://doi.org/10.3390/fi16050151