High Accurate Environmental Sound Classification: Sub-Spectrogram Segmentation versus Temporal-Frequency Attention Mechanism
Abstract
:1. Introduction
- Sub-spectrogram segmentation: It is very necessary to study the spectrograms of environmental sound more carefully. This is because the low frequency spectrum usually contains more fruitful information, as explained in [10]. Although a straight-forward sub-spectrogram segmentation as proposed in [20] is shown to be effective to improve the acoustic scene classification accuracy, the extension to ESC tasks still remains open. In addition, according to the existing literature, the number of sub-spectrogram segments, as well as the truncation rules need to be optimized as well;
- Attention mechanism: Another possible approach to improve the ESC performance is to incorporate the attention mechanism like human beings [21,22,23,24,25] in the convolutional feature layers, either through the temporal [24], frequency [26], channel [27] domain information, or even hybrid of them [27]. However, the previous joint attention scheme [27] focuses on combining the temporal and channel knowledge without considering the frequency domain characteristics, and the joint time-frequency feature is not fully exploited. As shown later, with the joint time and frequency domain attention, the ESC accuracy can be greatly improved;
- Recurrent architecture with data augmentation: The sound of many consecutive frames, such as , has strong correlations in the time domain, and the prediction via recurrent architecture will be possible. As shown in [28], exploiting the correlations among different scales of sequences can be applied to improve the classification accuracy as well. However, this method usually requires a large amount of data to support and comes with the problem of limited dataset. Therefore, it is very necessary to jointly consider the effective methods for expanding the dataset, such as mixup [29] and SpecAugment [30].
2. Preliminary
2.1. Log Gammatone Spectrogram
2.2. Deep Neural Networks
3. Overview of the Proposed High Accurate ESC
3.1. Overview
3.2. Sub-Spectrogram Segmentation
3.3. Temporal-Frequency Attention
4. Proposed Sub-Spectrogram Segmentation Based Classification Framework
4.1. Sub-Spectrogram Segmentation
4.2. CRNN with Mixup
4.3. Score Level Fusion
5. Proposed Temporal-Frequency Attention Based Classification Framework
5.1. Attention Map Generation
5.2. Data Augmentation Schemes
6. Experiments
6.1. Experiment Setup
6.2. Effect of Sub-Spectrogram Segmentation
6.3. Accuracy under Sub-Spectrogram Segmentation Based Classification Framework
6.4. Accuracy under Temporal-Frequency Attention Based Classification Framework
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Acknowledgments
Conflicts of Interest
References
- Barchiesi, D.; Giannoulis, D.; Stowell, D.; Plumbley, M.D. Acoustic scene classification: Classifying environments from the sounds they produce. IEEE Signal Process. Mag. 2015, 32, 16–34. [Google Scholar] [CrossRef]
- Zatorre, R.J.; Belin, P.; Penhune, V.B. Structure and function of auditory cortex: Music and speech. Trends Cogn. Sci. 2002, 6, 37–46. [Google Scholar] [CrossRef]
- Chachada, S.; Kuo, C.C.J. Environmental sound recognition: A survey. APSIPA Trans. Signal Inf. Process. 2014, 3, e14. [Google Scholar] [CrossRef] [Green Version]
- Chu, S.; Narayanan, S.; Kuo, C.C.J. Environmental sound recognition with time–frequency audio features. IEEE Trans. Audio Speech Lang. Process. 2009, 17, 1142–1158. [Google Scholar] [CrossRef]
- Cowling, M.; Sitte, R. Comparison of techniques for environmental sound recognition. Pattern Recogn. Lett. 2003, 24, 2895–2907. [Google Scholar] [CrossRef]
- Jalil, M.; Butt, F.A.; Malik, A. Short-time energy, magnitude, zero crossing rate and autocorrelation measurement for discriminating voiced and unvoiced segments of speech signals. In Proceedings of the International Conference on TAEECE, Konya, Turkey, 9–11 May 2013; pp. 208–212. [Google Scholar]
- Dennis, J.W. Sound Event Recognition in Unstructured Environments Using Spectrogram Image Processing; Nanyang Technological University: Singapore, 2014. [Google Scholar]
- Rabiner, L.R.; Juang, B.H.; Rutledge, J.C. Fundamentals of Speech Recognition; Prentice-Hall: Hoboken, NJ, USA, 1993; Volume 14. [Google Scholar]
- Piczak, K.J. Environmental sound classification with convolutional neural networks. In Proceedings of the 25th International Workshop Machine Learning Signal Processing, Boston, MA, USA, 17–20 September 2015; pp. 1–6. [Google Scholar]
- Valero, X.; Alias, F. Gammatone cepstral coefficients: Biologically inspired features for non-speech audio classification. IEEE Trans. MultiMedia 2012, 14, 1684–1689. [Google Scholar] [CrossRef]
- Keller, J.M.; Gray, M.R.; Givens, J.A. A fuzzy k-nearest neighbor algorithm. IEEE Trans. Syst. Man Cybern. 1985, 4, 580–585. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
- Scholkopf, B.; Smola, A.J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond; MIT Press: Cambridge, MA, USA, 2001. [Google Scholar]
- Atrey, P.K.; Maddage, N.C.; Kankanhalli, M.S. Audio based event detection for multimedia surveillance. In Proceedings of the 2006 International Conference on Acoustics Speech and Signal Process, Toulouse, France, 14–19 May 2006; Volume 5, p. V. [Google Scholar]
- Bisot, V.; Serizel, R.; Essid, S.; Richard, G. Feature learning with matrix factorization applied to acoustic scene classification. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1216–1229. [Google Scholar] [CrossRef] [Green Version]
- Tokozume, Y.; Ushiku, Y.; Harada, T. Learning from between-class examples for deep sound recognition. arXiv 2018, arXiv:1711.10282. [Google Scholar]
- McLoughlin, I.; Zhang, H.; Xie, Z.; Song, Y.; Xiao, W. Robust sound event classification using deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 540–552. [Google Scholar] [CrossRef] [Green Version]
- Boddapati, V.; Petef, A.; Rasmusson, J.; Lundberg, L. Classifying environmental sounds using image recognition networks. Procedia Comput. Sci. 2017, 112, 2048–2056. [Google Scholar] [CrossRef]
- Zhang, X.; Zou, Y.; Shi, W. Dilated convolution neural network with LeakyReLU for environmental sound classification. In Proceedings of the 22nd International Conference on Digital Signal Processing, London, UK, 23–25 August 2017; pp. 1–5. [Google Scholar]
- Phaye, S.S.R.; Benetos, E.; Wang, Y. SubSpectralNet-Using Sub-Spectrogram based Convolutional Neural Networks for Acoustic Scene Classification. arXiv 2018, arXiv:1810.12642. [Google Scholar]
- Guo, J.; Xu, N.; Li, L.J.; Alwan, A. Attention Based CLDNNs for Short-Duration Acoustic Scene Classification. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 469–473. [Google Scholar]
- Jun, W.; Shengchen, L. Self-Attention Mechanism Based System for Dcase2018 Challenge Task1 and Task4. In Proceedings of the DCASE Challenge, Surrey, UK, 19–20 November 2018. [Google Scholar]
- Zhang, Z.; Xu, S.; Qiao, T.; Zhang, S.; Cao, S. Attention based Convolutional Recurrent Neural Network for Environmental Sound Classification. arXiv 2019, arXiv:1907.02230. [Google Scholar]
- Li, X.; Chebiyyam, V.; Kirchhoff, K. Multi-stream Network with Temporal Attention For Environmental Sound Classification. arXiv 2019, arXiv:1901.08608. [Google Scholar]
- Ren, Z.; Kong, Q.; Qian, K.; Plumbley, M.D.; Schuller, B. Attention-Based Convolutional Neural Networks for Acoustic Scene Classification. In Proceedings of the DCASE Challenge, Surrey, UK, 19–20 November 2018. [Google Scholar]
- Wang, H.; Zou, Y.; Chong, D.; Wang, W. Learning discriminative and robust time-frequency representations for environmental sound classification. arXiv 2019, arXiv:1912.06808. [Google Scholar]
- Zhang, Z.; Xu, S.; Zhang, S.; Qiao, T.; Cao, S. Learning Attentive Representations for Environmental Sound Classification. IEEE Access 2019, 7, 130327–130339. [Google Scholar] [CrossRef]
- Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
- Zhang, Z.; Xu, S.; Cao, S.; Zhang, S. Deep convolutional neural network with mixup for environmental sound classification. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Guangzhou, China, 23–26 November 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 356–367. [Google Scholar]
- Suh, S.; Lim, W.; Park, S.; Jeong, Y. Acoustic Scene Classification Using SpecAugment and Convolutional Neural Network with Inception Modules. In Proceedings of the DCASE2019 Challenge, New York, NY, USA, 25–26 October 2019. [Google Scholar]
- Qiao, T.; Zhang, S.; Zhang, Z.; Cao, S.; Xu, S. Sub-Spectrogram Segmentation for Environmental Sound Classification via Convolutional Recurrent Neural Network and Score Level Fusion. In Proceedings of the 2019 IEEE International Workshop on Signal Processing Systems (SiPS), Nanjing, China, 20–23 October 2019; pp. 318–323. [Google Scholar]
- Van Loan, C.F. The ubiquitous Kronecker product. J. Comput. Appl. Math. 2000, 123, 85–100. [Google Scholar] [CrossRef] [Green Version]
- Piczak, K.J. ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane, Australia, 26–30 October 2015; pp. 1015–1018. [Google Scholar]
- Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. Mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
- Wang, H.; Zou, Y.; Chong, D.; Wang, W. Environmental Sound Classification with Parallel Temporal-spectral Attention. arXiv 2020, arXiv:1912.06808v3. [Google Scholar]
- Phan, H.; Chén, O.Y.; Pham, L.; Koch, P.; De Vos, M.; McLoughlin, I.; Mertins, A. Spatio-temporal attention pooling for audio scene classification. arXiv 2019, arXiv:1904.03543. [Google Scholar]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv 2019, arXiv:1904.08779. [Google Scholar]
(kHz) | (kHz) | (kHz) | Accuracy | |
---|---|---|---|---|
1 | 0 | - | 22.05 | 77.9% |
2 | 0 | 10 | 22.05 | 79.9% |
3 | 0 | 6, 10 | 22.05 | 81.7% |
4 | 0 | 3, 6, 10 | 22.05 | 82.1% |
5 | 0 | 3, 6, 10, 15 | 22.05 | 81.8% |
6 | 0 | 3, 6, 10, 13, 16 | 22.05 | 81.3% |
Layer | Nums of Filters | Filter Size | Stride | Output Size |
---|---|---|---|---|
Conv1 | 32 | (3, 3) | (1,1) | (128, 128, 32) |
Conv2 | 32 | (3, 3) | (1,1) | (128, 128, 32) |
Pool1 | - | - | (4, 2) | (32,64,32) |
Conv3 | 64 | (3, 1) | (1, 1) | (32, 64, 64) |
Conv4 | 64 | (3, 1) | (1, 1) | (32, 64, 64) |
Pool2 | - | - | (2, 1) | (16,64,64) |
Conv5 | 128 | (1, 3) | (1, 1) | (16, 64, 128) |
Conv6 | 128 | (1, 3) | (1, 1) | (16, 64, 128) |
Pool3 | - | - | (1, 2) | (16,32,128) |
Conv7 | 256 | (3, 3) | (1, 1) | (16, 32, 256) |
Conv8 | 256 | (3, 3) | (1, 1) | (16, 32, 256) |
Pool4 | - | - | (2, 2) | (8, 16, 256) |
GRU1 | 256 | - | - | (16, 256) |
GRU2 | 256 | - | - | (16, 256) |
FC1 | nums of classes | - | - | (nums of classes) |
Network | Mixup | Accuracy |
---|---|---|
CNN | × | 72.9% |
CRNN | × | 75.2% |
CNN | √ | 76.1% |
CRNN | √ | 77.9% |
(kHz) | (kHz) | (kHz) | Fusion | Accuracy | |
---|---|---|---|---|---|
2 | 0 | 10 | 22.05 | × | 76.2% |
2 | 0 | 10 | 22.05 | √ | 79.9% |
3 | 0 | 6, 10 | 22.05 | × | 78.1% |
3 | 0 | 6, 10 | 22.05 | √ | 81.7% |
4 | 0 | 3, 6, 10 | 22.05 | × | 79.8% |
4 | 0 | 3, 6, 10 | 22.05 | √ | 82.1% |
5 | 0 | 3, 6, 10, 15 | 22.05 | × | 77.9% |
5 | 0 | 3, 6, 10, 15 | 22.05 | √ | 81.8% |
Network | Max Pooling | Avg Pooling | Conv | Accuracy |
---|---|---|---|---|
CRNN | √ | × | × | 82.0% |
CRNN | × | √ | × | 81.8% |
CRNN | × | × | √ | 82.1% |
CRNN | √ | √ | √ | 82.3% |
Network | Layer | Layer | Layer | Layer | Accuracy |
---|---|---|---|---|---|
Pool1 | Pool2 | Pool3 | Pool4 | ||
CRNN | √ | × | × | × | 82.3% |
CRNN | × | √ | × | × | 82.0% |
CRNN | × | × | √ | × | 81.9% |
CRNN | × | × | × | √ | 81.9% |
CRNN | √ | √ | × | × | 82.0% |
CRNN | √ | × | √ | × | 82.6% |
CRNN | √ | × | × | √ | 82.0% |
CRNN | × | √ | √ | × | 81.9% |
CRNN | × | √ | × | √ | 82.6% |
CRNN | × | × | √ | √ | 82.1% |
CRNN | √ | √ | √ | × | 82.2% |
CRNN | √ | √ | × | √ | 82.0% |
CRNN | √ | × | √ | √ | 81.6% |
CRNN | × | √ | √ | √ | 82.7% |
CRNN | √ | √ | √ | √ | 83.1% |
Network | Mixup | SpecAugment | TFAM | Accuracy |
---|---|---|---|---|
CRNN | × | × | × | 75.2% |
CRNN | √ | × | √ | 83.1% |
CRNN | × | √ | √ | 82.7% |
CRNN | √ | √ | √ | 86.4% |
Parameters | Definition | Values |
---|---|---|
sampling frequency | 44,100 | |
number of classes | 50 | |
T | STFT point | 1024 |
N | frame length | 128 |
K | number of Gammatone-filter banks | 128 |
Mixup hyper-parameter | 0.2 |
(kHz) | (kHz) | (kHz) | Accuracy | ||
---|---|---|---|---|---|
1 | 0 | - | 22.05 | 1 | 77.9% |
2 | 0 | 10 | 22.05 | 0.7, 0.3 | 79.9% |
3 | 0 | 10, 20 | 22.05 | 0.5, 0.3, 0.2 | 80.2% |
3 | 0 | 7, 14 | 22.05 | 0.5, 0.2, 0.3 | 80.6% |
3 | 0 | 6, 10 | 22.05 | 0.5, 0.3, 0.2 | 81.7% |
4 | 0 | 10, 15, 20 | 22.05 | 0.5, 0.2, 0.2, 0.1 | 80.3% |
4 | 0 | 5, 10, 15 | 22.05 | 0.4, 0.3, 0.1, 0.2 | 80.9% |
4 | 0 | 3, 6, 10 | 22.05 | 0.4, 0.2, 0.2, 0.2 | 82.1% |
5 | 0 | 10, 13, 16, 19 | 22.05 | 0.4, 0.2, 0.1, 0.2, 0.1 | 81.0% |
5 | 0 | 5, 10, 15, 20 | 22.05 | 0.4, 0.2, 0.1, 0.2, 0.1 | 80.7% |
5 | 0 | 3, 6, 10, 15 | 22.05 | 0.4, 0.2, 0.2, 0.1, 0.1 | 81.8% |
6 | 0 | 3, 6, 10, 13, 16 | 22.05 | 0.3, 0.2, 0.2, 0.1, 0.1, 0.1 | 81.3% |
6 | 0 | 6, 10, 13, 16, 19 | 22.05 | 0.4, 0.1, 0.2, 0.1, 0.1, 0.1 | 81.2% |
Network | Mixup | Segmentation | Fusion | Accuracy |
---|---|---|---|---|
CNN | × | × | × | 72.9% |
CRNN | × | × | × | 75.2% |
CNN | √ | × | × | 76.1% |
CRNN | √ | × | × | 77.9% |
CNN | × | √ | × | 76.1% |
CNN | × | √ | √ | 77.6% |
CRNN | × | √ | × | 78.1% |
CRNN | × | √ | √ | 79.5% |
CNN | √ | √ | × | 78.2% |
CNN | √ | √ | √ | 80.8% |
CRNN | √ | √ | × | 79.8% |
CRNN | √ | √ | √ | 82.1% |
Network | Mixup | SpecAugment | Segmentation | TFAM | Accuracy |
---|---|---|---|---|---|
CNN | × | × | × | × | 72.9% |
CRNN | × | × | × | × | 75.2% |
CRNN | √ | × | × | × | 77.9% |
CRNN | × | √ | × | × | 77.5% |
CRNN | × | × | √ | × | 79.5% |
CRNN | × | × | × | √ | 80.3% |
CRNN | √ | × | √ | × | 82.1% |
CRNN | √ | × | × | √ | 83.1% |
CRNN | × | √ | × | √ | 82.7% |
CRNN | √ | √ | × | √ | 86.4% |
Existing Methods | TFAM |
---|---|
PiczakCNN | 64.9% |
Google Net | 67.8% |
SoundNet | 74.2% |
AlexNet | 78.7% |
WaveMsNet | 79.1% |
ProCNN | 82.8% |
Multi-Stream CNN | 83.5% |
EnvNet-v2 | 84.9% |
ACRNN | 86.1% |
TFAM | 86.4% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Qiao, T.; Zhang, S.; Cao, S.; Xu, S. High Accurate Environmental Sound Classification: Sub-Spectrogram Segmentation versus Temporal-Frequency Attention Mechanism. Sensors 2021, 21, 5500. https://doi.org/10.3390/s21165500
Qiao T, Zhang S, Cao S, Xu S. High Accurate Environmental Sound Classification: Sub-Spectrogram Segmentation versus Temporal-Frequency Attention Mechanism. Sensors. 2021; 21(16):5500. https://doi.org/10.3390/s21165500
Chicago/Turabian StyleQiao, Tianhao, Shunqing Zhang, Shan Cao, and Shugong Xu. 2021. "High Accurate Environmental Sound Classification: Sub-Spectrogram Segmentation versus Temporal-Frequency Attention Mechanism" Sensors 21, no. 16: 5500. https://doi.org/10.3390/s21165500
APA StyleQiao, T., Zhang, S., Cao, S., & Xu, S. (2021). High Accurate Environmental Sound Classification: Sub-Spectrogram Segmentation versus Temporal-Frequency Attention Mechanism. Sensors, 21(16), 5500. https://doi.org/10.3390/s21165500