Attention-Based Joint Training of Noise Suppression and Sound Event Detection for Noise-Robust Classification
Abstract
:1. Introduction
2. Proposed System
2.1. Noise Suppression Model
2.1.1. Deep Feature Loss
2.1.2. Auxiliary Model
2.2. Sound Event Detection Model
2.3. Joint Training
2.4. Attention Mechanism
3. Experiment
3.1. Dataset
3.2. Metrics
3.3. Feature Extraction for SED Model
3.4. Baseline Model
3.5. Training Details and Evaluation
3.5.1. Noise Suppression Model
3.5.2. Auxiliary Model and Sound Event Detection Model
3.5.3. Joint Training Model
3.5.4. Evaluation
4. Results and Discussion
4.1. Simulation Results
4.2. Real Experimental Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Barchiesi, D.; Giannoulis, D.; Stowell, D.; Plumbley, M.D. Acoustic scene classification: Classifying environments from the sounds they produce. IEEE Signal Process. Mag. 2015, 32, 16–34. [Google Scholar] [CrossRef]
- Butko, T.; Pla, F.G.; Segura, C.; Nadeu, C.; Hernando, J. Two-source acoustic event detection and localization: Online implementation in a smart-room. In Proceedings of the 19th European Signal Processing Conference (EUSIPCO), Barcelona, Spain, 29 August–2 September 2011; pp. 1317–1321. [Google Scholar]
- Anwar, M.Z.; Kaleem, Z.; Jamalipour, A. Machine learning inspired sound-based amateur drone detection for public safety applications. IEEE Trans. Veh. Technol. 2019, 68, 2526–2534. [Google Scholar] [CrossRef]
- Ahmad, K.; Conci, N. How deep features have improved event recognition in multimedia: A survey. ACM Trans. Multimed. Comput. Commun. Appl. 2019, 15, 39. [Google Scholar] [CrossRef]
- Goetze, S.; Schroder, J.; Gerlach, S.; Hollosi, D.; Appell, J.E.; Wallhoff, F. Acoustic monitoring and localization for social care. J. Comput. Sci. Eng. 2012, 6, 40–50. [Google Scholar] [CrossRef] [Green Version]
- Alsina-Pagès, R.M.; Navarro, J.; Alías, F.; Hervás, M. homesound: Real-time audio event detection based on high performance computing for behaviour and surveillance remote monitoring. Sensors 2017, 17, 854. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zhang, X.; He, Q.; Feng, X. Acoustic feature extraction by tensor-based sparse representation for sound effects classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 166–170. [Google Scholar]
- Niessen, M.E.; Van Kasteren, T.L.; Merentitis, A. Hierarchical modeling using automated sub-clustering for sound event recognition. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 20–23 October 2013; pp. 1–4. [Google Scholar]
- Phan, H.; Maaß, M.; Mazur, R.; Mertins, A. Random regression forests for acoustic event detection and classification. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 20–31. [Google Scholar] [CrossRef] [Green Version]
- Lu, X.; Tsao, Y.; Matsuda, S.; Hori, C. Sparse representation based on a bag of spectral exemplars for acoustic event detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 6255–6259. [Google Scholar]
- McLoughlin, I.; Zhang, H.; Xie, Z.; Song, Y.; Xiao, W. Robust sound event classification using deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 540–552. [Google Scholar] [CrossRef] [Green Version]
- Tran, H.D.; Li, H. Sound event recognition with probabilistic distance SVMs. IEEE/ACM Trans. Audio Speech Lang. Process. 2010, 19, 1556–1568. [Google Scholar] [CrossRef]
- Chin, M.L.; Burred, J.J. Audio event detection based on layered symbolic sequence representations. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 1953–1956. [Google Scholar]
- Gemmeke, J.F.; Vuegen, L.; Karsmakers, P.; Vanrumste, B. An exemplar-based NMF approach to audio event detection. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 20–23 October 2013; pp. 1–4. [Google Scholar]
- Mesaros, A.; Heittola, T.; Dikmen, O.; Virtanen, T. Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 151–155. [Google Scholar]
- Zhang, H.; McLoughlin, I.; Song, Y. Robust sound event recognition using convolutional neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 559–563. [Google Scholar]
- Phan, H.; Hertel, L.; Maass, M.; Mertins, A. Robust Audio Event Recognition with 1-max Pooling Convolutional Neural Networks. Available online: https://arxiv.org/pdf/1604.06338.pdf (accessed on 1 September 2021).
- Cakır, E.; Parascandolo, G.; Heittola, T.; Huttunen, H.; Virtanen, T. Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1291–1303. [Google Scholar] [CrossRef] [Green Version]
- Parascandolo, G.; Huttunen, H.; Virtanen, T. Recurrent neural networks for polyphonic sound event detection in real life recordings. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 6440–6444. [Google Scholar]
- Hayashi, T.; Watanabe, S.; Toda, T.; Hori, T.; Le Roux, J.; Takeda, K. Duration-controlled LSTM for polyphonic sound event detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 2059–2070. [Google Scholar] [CrossRef]
- Adavanne, S.; Pertilä, P.; Virtanen, T. Sound event detection using spatial features and convolutional recurrent neural network. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 771–775. [Google Scholar]
- Ahmed, Q.Z.; Alouini, M.; Aissa, S. Bit Error-Rate Minimizing Detector for Amplify-and-Forward Relaying Systems Using Generalized Gaussian Kernel. IEEE Signal Process. Lett. 2013, 20, 55–58. [Google Scholar] [CrossRef] [Green Version]
- Novey, M.; Adali, T.; Roy, A. A complex generalized Gaussian distribution—Characterization, generation, and estimation. IEEE Trans. Signal Process. 2010, 58, 1427–1433. [Google Scholar] [CrossRef]
- Liu, Y.; Qiu, T.; Li, J. Joint estimation of time difference of arrival and frequency difference of arrival for cyclostationary signals under impulsive noise. Digit. Signal Process. 2015, 46, 68–80. [Google Scholar] [CrossRef]
- Choi, I.; Kwon, K.; Bae, S.H.; Kim, N.S. DNN-based sound event detection with exemplar-based approach for noise reduction. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2016 Workshop), Budapest, Hungary, 3 September 2016; pp. 16–19. [Google Scholar]
- Zhou, Q.; Feng, Z.; Benetos, E. Adaptive noise reduction for sound event detection using subband-weighted NMF. Sensors 2019, 19, 3206. [Google Scholar] [CrossRef] [Green Version]
- Wan, T.; Zhou, Y.; Ma, Y.; Liu, H. Noise robust sound event detection using deep learning and audio enhancement. In Proceedings of the IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), Ajman, UAE, 10–12 December 2019; pp. 1–5. [Google Scholar]
- Noh, K.; Chang, J.H. Joint optimization of deep neural network-based dereverberation and beamforming for sound event detection in multi-channel environments. Sensors 2020, 20, 1883. [Google Scholar] [CrossRef] [Green Version]
- Luo, Y.; Mesgarani, N. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1256–1266. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wang, Z.Q.; Wang, D. A joint training framework for robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 796–806. [Google Scholar] [CrossRef]
- Liu, B.; Nie, S.; Liang, S.; Liu, W.; Yu, M.; Chen, L.; Peng, S.; Li, C. Jointly adversarial enhancement training for robust end-to-end speech recognition. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 491–495. [Google Scholar]
- Fan, C.; Yi, J.; Tao, J.; Tian, Z.; Liu, B.; Wen, Z. Gated recurrent fusion with joint training framework for robust end-to-end speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 29, 198–209. [Google Scholar] [CrossRef]
- Pandey, A.; Liu, C.; Wang, Y.; Saraf, Y. Dual application of speech enhancement for automatic speech recognition. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 223–228. [Google Scholar]
- Li, L.; Kang, Y.; Shi, Y.; Kürzinger, L.; Watzel, T.; Rigoll, G. Adversarial Joint Training with Self-Attention Mechanism for Robust End-to-End Speech Recognition. Available online: https://arxiv.org/pdf/2104.01471.pdf (accessed on 1 September 2021).
- Kinoshita, K.; Ochiai, T.; Delcroix, M.; Nakatani, T. Improving noise robust automatic speech recognition with single-channel time-domain enhancement network. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7009–7013. [Google Scholar]
- Woo, J.; Mimura, M.; Yoshii, K.; Kawahara, T. End-to-end music-mixed speech recognition. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, 7–10 December 2020; pp. 800–804. [Google Scholar]
- Luo, Y.; Han, C.; Mesgarani, N. Group communication with context codec for lightweight source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1752–1761. [Google Scholar] [CrossRef]
- Germain, F.G.; Chen, Q.; Koltun, V. Speech Denoising with Deep Feature Losses. Available online: https://arxiv.org/pdf/1806.10522.pdf (accessed on 1 September 2021).
- Kim, J.H.; Chang, J.H. Attention wave-U-net for acoustic echo cancellation. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 3969–3973. [Google Scholar]
- Giri, R.; Isik, U.; Krishnaswamy, A. Attention wave-u-net for speech enhancement. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 20–23 October 2019; pp. 249–253. [Google Scholar]
- Adavanne, S.; Politis, A.; Virtanen, T. A Multi-Room Reverberant Dataset for Sound Event Localization and Detection. Available online: https://arxiv.org/pdf/1905.08546.pdf (accessed on 1 September 2021).
- Reddy, C.K.; Beyrami, E.; Dubey, H.; Gopal, V.; Cheng, R.; Cutler, R.; Matusevych, S.; Aichner, R.; Aazami, A.; Braun, S.; et al. The Interspeech 2020 Deep Noise Suppression Challenge: Datasets, Subjective Speech Quality and Testing Framework. Available online: https://arxiv.org/pdf/2001.08662.pdf (accessed on 1 September 2021).
- Gemmeke, J.F.; Ellis, D.P.; Freedman, D.; Jansen, W.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 776–780. [Google Scholar]
- Thiemann, J.; Ito, N.; Vincent, E. The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings. Proc. Meet. Acoust. 2013, 19, 035081. [Google Scholar]
- Mesaros, A.; Heittola, T.; Virtanen, T. Metrics for polyphonic sound event detection. Appl. Sci. 2016, 6, 162. [Google Scholar] [CrossRef]
- Goldstein, J.S.; Reed, I.S. Subspace selection for partially adaptive sensor array processing. IEEE Trans. Aerosp. Electron. Syst. 1997, 33, 539–544. [Google Scholar] [CrossRef]
- Alouini, M.-S.; Scaglione, A.; Giannakis, G.B. PCC: Principal components combining for dense correlated multipath fading environments. In Proceedings of the IEEE Vehicular Technology Conference Fall (VTC), Boston, MA, USA, 24–28 September 2000; pp. 2510–2517. [Google Scholar]
- Husbands, R.; Ahmed, Q.; Wang, J. Transmit antenna selection for massive MIMO: A knapsack problem formulation. In Proceedings of the IEEE International Conferennce on Communications (ICC), Paris, France, 21–25 May 2017; pp. 1–6. [Google Scholar]
- De Lamare, R.C.; Sampaio-Neto, R. Adaptive reduced-rank equalization algorithms based on alternating optimization design techniques for MIMO systems. IEEE Trans. Veh. Technol. 2011, 60, 2482–2494. [Google Scholar] [CrossRef]
Hyperparameters | Notation |
---|---|
Number of encoder filters | N |
Length of the filters | L |
Number of channels in convolutional blocks | H |
Kernel size in convolutional blocks | P |
Number of convolutional blocks in each repeat | X |
Number of repeats | R |
Number of groups | K |
Group size | M |
Context size (in frames) | C |
TCN block size (in frames) | B |
Layers | Output Size |
---|---|
Input | 1 × 128 ×T |
Convolution 1 | 16 × 64 × T |
Max-pooling 1 | 16 × 32 × T |
Convolution 2 | 32 × 16 × T |
Max-pooling 2 | 32 × 8 × T |
Convolution 3 | 64 × 4 × T |
Max-pooling 3 | 64 × 2 × T |
Reshape | T × 128 |
BLSTM × 3 | T × 128 |
Fully connected | T × 128 |
Output | T × 11 |
Method | Model | Evaluation | ||
---|---|---|---|---|
NS | SED | F-Score (%) | Error Rate | |
Baseline | - | CRNN | 33.7 | 0.78 |
Before JT (pretrained) | GC3-TCN | CRNN | 31.4 | 0.81 |
After JT (fine-tuned) | GC3-TCN | CRNN w/o attention | ||
w/o freeze | 42.4 | 0.70 | ||
freeze dense | 43.6 | 0.68 | ||
+ CNN | 38.0 | 0.74 | ||
+ RNN | 34.3 | 0.78 | ||
CRNN w/ attention | ||||
+ freeze dense | 45.2 | 0.66 |
Method | Model | Evaluation | ||
---|---|---|---|---|
NS | SED | F-Score (%) | Error Rate | |
Baseline | - | CRNN | 41.9 | 0.71 |
Before JT (pretrained) | GC3-TCN | CRNN | 36.0 | 0.75 |
After JT (fine-tuned) | GC3-TCN | CRNN w/o attention | ||
w/o freeze | 48.7 | 0.66 | ||
freeze dense | 50.5 | 0.64 | ||
+ CNN | 47.8 | 0.66 | ||
+ RNN | 44.6 | 0.69 | ||
CRNN w/ attention | ||||
+ freeze dense | 51.9 | 0.62 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Son, J.-Y.; Chang, J.-H. Attention-Based Joint Training of Noise Suppression and Sound Event Detection for Noise-Robust Classification. Sensors 2021, 21, 6718. https://doi.org/10.3390/s21206718
Son J-Y, Chang J-H. Attention-Based Joint Training of Noise Suppression and Sound Event Detection for Noise-Robust Classification. Sensors. 2021; 21(20):6718. https://doi.org/10.3390/s21206718
Chicago/Turabian StyleSon, Jin-Young, and Joon-Hyuk Chang. 2021. "Attention-Based Joint Training of Noise Suppression and Sound Event Detection for Noise-Robust Classification" Sensors 21, no. 20: 6718. https://doi.org/10.3390/s21206718
APA StyleSon, J. -Y., & Chang, J. -H. (2021). Attention-Based Joint Training of Noise Suppression and Sound Event Detection for Noise-Robust Classification. Sensors, 21(20), 6718. https://doi.org/10.3390/s21206718