Training a Filter-Based Model of the Cochlea in the Context of Pre-Trained Acoustic Models
Abstract
:1. Introduction
- Trainable filters can replace the encoder CNN in an already pre-trained model for fine-tuning.
- A physiologically adaptable front-end performs as well as a CNN in a pre-trained model.
- Trainable filters can be incorporated during self-supervision.
- When trained with a large transformer model, SincNet filters do not tend to learn wide-band filters as they do with a smaller MLP model.
2. Background
2.1. Self-Supervised Models
2.1.1. Foundations
2.1.2. Wav2vec
2.2. Cochlear Models
2.2.1. Filterbanks
2.2.2. Current Understanding
2.3. ASR with Trainable Filters
2.3.1. From Cochlear Models to E2E ASR
2.3.2. SincNet
2.4. Speech Features
2.5. Wide-Band Structures
3. Method
3.1. Overall Hypothesis
3.2. Pre-Trained Model
3.3. Dataset
4. Experiments
4.1. Can Trainable Filters Replace the Encoder CNN in an Already Pre-Trained Model?
4.1.1. Choice of Global Parameters
Learning Rate Schedulers
- and are bias-corrected estimates of the linear combination of the gradient with first and second moment estimates.
- is a small constant preventing a division by 0.
- is the learning rate.
SincNet Modifications
Kernel Size
Number of Updates
4.1.2. Results
4.2. Does a Physiologically Adapted Front-End Perform as Well as a CNN in a Pre-Trained Model?
4.2.1. Hypothesis
4.2.2. Results
4.3. Can Trainable Filters Be Incorporated during Self-Supervision?
4.3.1. Hypothesis
Self-Supervised Learning
Fine-Tuning
4.3.2. Results
4.4. Do Wide-Band Filters Appear in Some Other Training or Model Configurations?
4.4.1. The Hypothesis of Too Few Filters
Hypothesis
Results
4.4.2. The Transformers Precluding the Filters to Learn Wide-Band
Hypothesis
Results
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Seide, F.; Li, G.; Yu, D. Conversational Speech Transcription Using Context-Dependent Deep Neural Networks. In Proceedings of the Interspeech Conference, Florence, Italy, 27–31 August 2011; pp. 437–440. [Google Scholar]
- Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.R.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N.; et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
- Sennrich, R.; Haddow, B.; Birch, A. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2015; Volume 1, pp. 1715–1725. [Google Scholar] [CrossRef]
- Collobert, R.; Puhrsch, C.; Synnaeve, G. Wav2letter: An end-to-end convnet-based speech recognition system. arXiv 2016, arXiv:1609.03193. [Google Scholar]
- Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. Wav2vec: Unsupervised Pre-training for Speech Recognition. arXiv 2019, arXiv:1904.05862. [Google Scholar] [CrossRef]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar] [CrossRef]
- Parthasarathi, S.H.K.; Strom, N. Lessons from Building Acoustic Models with a Million Hours of Speech. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6670–6674. [Google Scholar] [CrossRef]
- Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.; Grangier, D.; Auli, M. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of the NAACL-HLT: Demonstrations, Minneapolis, MN, USA, 2 June 2019. [Google Scholar] [CrossRef]
- Babu, A.; Wang, C.; Tjandra, A.; Lakhotia, K.; Xu, Q.; Goyal, N.; Singh, K.; von Platen, P.; Saraf, Y.; Pino, J.; et al. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale. arXiv 2021, arXiv:2111.09296. [Google Scholar]
- Dahl, G.; Yu, D.; Deng, L.; Acero, A. Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition. IEEE Trans. Audio Speech Lang. Process. 2012, 20, 30–42. [Google Scholar] [CrossRef]
- Millet, J.; Caucheteux, C.; Orhan, P.; Boubenec, Y.; Gramfort, A.; Dunbar, E.; Pallier, C.; King, J.R. Toward a realistic model of speech processing in the brain with self-supervised learning. In Proceedings of the Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 10–16 December 2022. [Google Scholar]
- Coppieters de Gibson, L.; Garner, P.N. Low-Level Physiological Implications of End-to-End Learning of Speech Recognition. Proc. Interspeech 2022, 749–753. [Google Scholar] [CrossRef]
- Collobert, R.; Weston, J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning (ICML), Helsinki, Finland, 5–9 July 2008; Association for Computing Machinery: New York, NY, USA, 2008; pp. 160–167. [Google Scholar] [CrossRef]
- Lample, G.; Conneau, A.; Denoyer, L.; Ranzato, M. Unsupervised machine translation using monolingual corpora only. arXiv 2017, arXiv:1711.00043. [Google Scholar]
- Baevski, A.; Schneider, S.; Auli, M. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations. arXiv 2019, arXiv:1910.05453. [Google Scholar]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: New York, NY, USA, 2020; Volume 33, pp. 12449–12460. [Google Scholar]
- Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, 25–29 June 2006; Association for Computing Machinery: New York, NY, USA, 2006; pp. 369–376. [Google Scholar] [CrossRef]
- Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
- Von Békésy, G. Experiments in Hearing; McGraw-Hill: New York, NY, USA, 1960. [Google Scholar]
- Geisler, C.D. Mathematical Models of the Mechanics of the Inner Ear; Springer: Berlin/Heidelberg, Germany, 1976; pp. 391–415. [Google Scholar] [CrossRef]
- Zwislocki, J. Review of recent mathematical theories of cochlear dynamics. J. Acoust. Soc. Am. (JASA) 1953, 25, 743–751. [Google Scholar] [CrossRef]
- Lyon, R.F. Cascades of two-pole–two-zero asymmetric resonators are good models of peripheral auditory function. J. Acoust. Soc. Am. 2011, 130, 3893–3904. [Google Scholar] [CrossRef] [PubMed]
- Lyon, R.F. Using a Cascade of Asymmetric Resonators with Fast-Acting Compression as a Cochlear Model for Machine-Hearing Applications. In Proceedings of the Autumn Meeting of the Acoustical Society of Japan; Acoustical Society of Japan: Tokyo, Japan, 2011; pp. 509–512. [Google Scholar]
- Lyon, R.F. Human and Machine Hearing: Extracting Meaning from Sound; Cambridge University Press: Cambridge, UK, 2017. [Google Scholar]
- Thakur, C.S.; Hamilton, T.J.; Tapson, J.; van Schaik, A.; Lyon, R.F. FPGA Implementation of the CAR Model of the Cochlea. In Proceedings of the 2014 IEEE International Symposium on Circuits and Systems (ISCAS), Melbourne, VIC, Australia, 1–5 June 2014; pp. 1853–1856. [Google Scholar] [CrossRef]
- Islam, M.A.; Xu, Y.; Monk, T.; Afshar, S.; van Schaik, A. Noise-robust text-dependent speaker identification using cochlear models. J. Acoust. Soc. Am. (JASA) 2022, 151, 500–516. [Google Scholar] [CrossRef] [PubMed]
- Xu, Y.; Afshar, S.; Wang, R.; Cohen, G.; Singh Thakur, C.; Hamilton, T.J.; van Schaik, A. A biologically inspired sound localisation system using a silicon cochlea pair. Appl. Sci. 2021, 11, 1519. [Google Scholar] [CrossRef]
- Pedersen, P. The mel scale. J. Music. Theory 1965, 9, 295–308. [Google Scholar] [CrossRef]
- Hermansky, H. Perceptual Linear Predictive (PLP) Analysis of Speech. J. Acoust. Soc. Am. 1990, 87, 1738–1752. [Google Scholar] [CrossRef] [PubMed]
- Smith, J.O.; Abel, J.S. Bark and ERB bilinear transforms. IEEE Trans. Speech Audio Process. 1999, 7, 697–708. [Google Scholar] [CrossRef]
- Moore, B.C.; Glasberg, B.R. Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. J. Acoust. Soc. Am. 1983, 74, 750–753. [Google Scholar] [CrossRef] [PubMed]
- Zwicker, E. Subdivision of the Audible Frequency Range into Critical Bands (Frequenzgruppen). J. Acoust. Soc. Am. 1961, 33, 248. [Google Scholar] [CrossRef]
- Sridhar, D.; Stakhovskaya, O.; Leake, P.A. A Frequency-Position Function for the Human Cochlear Spiral Ganglion. Audiol. Neurotol. 2006, 11, 16–20. [Google Scholar] [CrossRef]
- De Boer, E. On active and passive cochlear models—Toward a generalized analysis. J. Acoust. Soc. Am. 1983, 73, 574–576. [Google Scholar] [CrossRef]
- Hudspeth, A. Making an effort to listen: Mechanical amplification in the ear. Neuron 2008, 59, 530–545. [Google Scholar] [CrossRef] [PubMed]
- Hudspeth, A.; Jülicher, F.; Martin, P. A critique of the critical cochlea: Hopf—a bifurcation—is better than none. J. Neurophysiol. 2010, 1043, 1219–1229. [Google Scholar] [CrossRef]
- Probst, R.; Lonsbury-Martin, B.L.; Martin, G.K. A review of otoacoustic emissions. J. Acoust. Soc. Am. (JASA) 1991, 89, 2027–2067. [Google Scholar] [CrossRef] [PubMed]
- Kemp, D.T. Otoacoustic emissions, their origin in cochlear function, and use. Br. Med. Bull. 2002, 63, 223–241. [Google Scholar] [CrossRef] [PubMed]
- Hamilton, T.J.; Jin, C.; Tapson, J.; van Schaik, A. A 2-D cochlea with Hopf oscillators. In Proceedings of the IEEE Biomedical Circuits and Systems Conference, Montreal, QC, Canada, 27–30 November 2007; pp. 91–94. [Google Scholar] [CrossRef]
- Hamilton, T.J.; Tapson, J.; Jin, C.; Van Schaik, A. Analogue VLSI implementations of two dimensional, nonlinear, active cochlea models. In Proceedings of the Biomedical Circuits and Systems Conference, Baltimore, MD, USA, 20–22 November 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 153–156. [Google Scholar] [CrossRef]
- Ammari, H.; Davies, B. Mimicking the active cochlea with a fluid-coupled array of subwavelength Hopf resonators. Proc. R. Soc. A Math. Phys. Eng. Sci. 2020, 476, 20190870. [Google Scholar] [CrossRef]
- Davis, S.B.; Mermelstein, P. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems 25 (NIPS 2012), Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
- Palaz, D.; Collobert, R.; Doss, M.M. Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. arXiv 2013, arXiv:1304.1018. [Google Scholar]
- Palaz, D.; Doss, M.M.; Collobert, R. Convolutional neural networks-based continuous speech recognition using raw speech signal. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, USA, 19–24 April 2015; pp. 4295–4299. [Google Scholar]
- Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef] [PubMed]
- Chakraborty, S.; Tomsett, R.; Raghavendra, R.; Harborne, D.; Alzantot, M.; Cerutti, F.; Srivastava, M.; Preece, A.; Julier, S.; Rao, R.M.; et al. Interpretability of deep learning models: A survey of results. In Proceedings of the IEEE Smartworld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation, San Francisco, CA, USA, 4–8 August 2017; pp. 1–6. [Google Scholar]
- Sainath, T.; Weiss, R.J.; Wilson, K.; Senior, A.W.; Vinyals, O. Learning the speech front-end with raw waveform CLDNNs. In Proceedings of the Interspeech, Dresden, Germany, 6–10 September 2015. [Google Scholar]
- López-Espejo, I.; Tan, Z.H.; Jensen, J. Exploring Filterbank Learning for Keyword Spotting. In Proceedings of the 2020 28th European Signal Processing Conference (EUSIPCO), Amsterdam, The Netherlands, 18–21 January 2021; pp. 331–335. [Google Scholar] [CrossRef]
- Zeghidour, N.; Usunier, N.; Kokkinos, I.; Schaiz, T.; Synnaeve, G.; Dupoux, E. Learning filterbanks from raw speech for phone recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5509–5513. [Google Scholar] [CrossRef]
- Noé, P.G.; Parcollet, T.; Morchid, M. Cgcnn: Complex gabor convolutional neural network on raw speech. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7724–7728. [Google Scholar]
- Zeghidour, N.; Teboul, O.; Quitry, F.d.C.; Tagliasacchi, M. LEAF: A learnable frontend for audio classification. arXiv 2021, arXiv:2101.08596. [Google Scholar]
- Ravanelli, M.; Bengio, Y. Speaker recognition from raw waveform with sincnet. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; pp. 1021–1028. [Google Scholar] [CrossRef]
- Ravanelli, M.; Bengio, Y. Interpretable convolutional filters with sincnet. arXiv 2018, arXiv:1811.09725. [Google Scholar]
- Balestriero, R.; Cosentino, R.; Glotin, H.; Baraniuk, R. Spline filters for end-to-end deep learning. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 364–373. [Google Scholar]
- Parcollet, T.; Morchid, M.; Linares, G. E2E-SINCNET: Toward fully end-to-end speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7714–7718. [Google Scholar]
- Olive, J.P.; Greenwood, A.; Coleman, J. Acoustics of American English Speech: A Dynamic Approach; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1993. [Google Scholar]
- Ravanelli, M.; Parcollet, T.; Bengio, Y. The PyTorch-Kaldi Speech Recognition Toolkit. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6465–6469. [Google Scholar] [CrossRef]
- Lomakin, O.; Davis, K.A. On the role of the wideband inhibitor in the dorsal cochlear nucleus: A computational modeling study. J. Assoc. Res. Otolaryngol. 2008, 9, 506–520. [Google Scholar] [CrossRef] [PubMed]
- Biebel, U.W.; Langner, G. Evidence for interactions across frequency channels in the inferior colliculus of awake chinchilla. Hear. Res. 2002, 169, 151–168. [Google Scholar] [CrossRef] [PubMed]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Subset | h | min./spk | f | m | tot. | exp. |
---|---|---|---|---|---|---|
dev-clean | 5.4 | 8 | 20 | 20 | 40 | Evaluation |
dev-other | 5.3 | 10 | 16 | 17 | 33 | Validation |
train-100 | 100.1 | 25 | 125 | 126 | 251 | SS and FT |
train-360 | 263.6 | 25 | 439 | 482 | 921 | SS |
train-500 | 496.7 | 30 | 564 | 602 | 1166 | SS |
n_filters | Maxpooling | Kernel Size | n_updates | WER |
---|---|---|---|---|
40 | - | 400 | 100 k | 3.31 |
40 | - | 400 | 10 k | 3.53 |
40 | 3 | 400 | 10 k | 3.64 |
Model | Time | N. Parameters |
---|---|---|
Small CNN | 7 h 14 min | 0.575 M |
Large CNN | 9 h 48 min | 4.406 M |
Relearn CNN (w2v2 shape) | 11 h 57 min | 4.206 M |
Relearn CNN (SN shape) | 14 h 7 min | 4.422 M |
Train Loss | Valid Loss | WER [%] | |
---|---|---|---|
Relearn CNN (w2v shape) | 129.5 | 28.67 | 3.35 ± 0.15 |
Relearn CNN (SN shape) | 124.7 | 28.96 | 3.45 ± 0.15 |
SincNet (large CNN) | 120.8 | 28.48 | 3.33 ± 0.15 |
SincNet (small CNN) | 122.3 | 30.26 | 3.53 ± 0.15 |
Number of Updates | Accuracy [%] | Loss | |||
---|---|---|---|---|---|
Train | Valid | Train | Valid | ||
Baseline | 0 | 61.1 | 64.6 | 2.10 | 1.96 |
Frozen filters | 10 k | 52.2 | 62.2 | 2.49 | 2.13 |
Trained filters | 10 k | 56.9 | 63.0 | 2.34 | 2.06 |
Frozen filters | 100 k | 60.5 | 66.5 | 2.13 | 1.87 |
Trained filters | 100 k | 61.4 | 67.2 | 2.09 | 1.84 |
Pre-Training Phase | Fine-Tuning Phase | WER [%] |
---|---|---|
Frozen filters | Trainable filters | 3.40 |
Trainable filters | Trainable filters | 3.37 |
n_filters | Maxpool | Kernel Size | WER |
---|---|---|---|
40 | 3 | 400 | 3.64 |
60 | 3 | 400 | 3.61 |
80 | 3 | 400 | 3.57 |
100 | 3 | 400 | 3.56 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Coppieters de Gibson, L.; Garner, P.N. Training a Filter-Based Model of the Cochlea in the Context of Pre-Trained Acoustic Models. Acoustics 2024, 6, 470-488. https://doi.org/10.3390/acoustics6020025
Coppieters de Gibson L, Garner PN. Training a Filter-Based Model of the Cochlea in the Context of Pre-Trained Acoustic Models. Acoustics. 2024; 6(2):470-488. https://doi.org/10.3390/acoustics6020025
Chicago/Turabian StyleCoppieters de Gibson, Louise, and Philip N. Garner. 2024. "Training a Filter-Based Model of the Cochlea in the Context of Pre-Trained Acoustic Models" Acoustics 6, no. 2: 470-488. https://doi.org/10.3390/acoustics6020025
APA StyleCoppieters de Gibson, L., & Garner, P. N. (2024). Training a Filter-Based Model of the Cochlea in the Context of Pre-Trained Acoustic Models. Acoustics, 6(2), 470-488. https://doi.org/10.3390/acoustics6020025