Speaker Recognition Based on the Joint Loss Function
Abstract
:1. Introduction
- We propose a speaker recognition model based on joint loss function training. During the model training process, both cross-entropy loss and contrastive learning loss work together, considering the differences between different speakers and the similarities between the same speakers. This allows the model to better learn speaker-specific feature information.
- To better leverage contrastive learning, we improved the SPD-TDNN module, adjusting the position of the BN layer such that the output of the activation function is normalized through the BN layer to better preserve the input’s dynamic range, thereby enhancing the model’s nonlinear expression capability and generalization ability.
2. Related Works
2.1. Speaker Recognition System Framework
2.2. Baseline Network Model
2.2.1. SPD-TDNN Layer
2.2.2. SPD-TDNN Model
2.2.3. Other Models
2.3. Data Augmentation
2.3.1. Reverberation Enhancement
2.3.2. Noise Enhancement
2.3.3. SpecAugment
- Time masking: Randomly select a continuous interval on the timeline and set it to 0, which is equivalent to blocking all the sound signals in the interval. This operation can simulate the interruptions and missing information in the speech signal.
- Frequency masking: Randomly select a continuous interval on the frequency axis and set it to 0, which is equivalent to masking the frequency information of the interval. This operation can simulate noise and distortion in the speech signal.
- Frequency warping: The spectrogram is distorted on the frequency axis to stretch or compress some frequency intervals. This operation can simulate intonation changes and accent differences in speech signals.
3. JLF-ISPD-TDNN Model
3.1. AAM-Softmax
3.2. InfoNCE
3.3. Joint Loss Function
4. Experiment
4.1. Experimental Setup
4.2. Experimental Results and Discussion
4.2.1. The Impact of on Model Performance
4.2.2. Results and Analysis
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Reynolds, D.A.; Quatieri, T.F.; Dunn, R.B. Speaker Verification Using Adapted Gaussian Mixture Models. Digit. Signal Process 2000, 10, 19–41. [Google Scholar] [CrossRef] [Green Version]
- Dehak, N.; Kenny, P.; Dehak, R.; Dumouchel, P.; Ouellet, P. Front-End Factor Analysis for Speaker Verification. IEEE Trans. Speech Audio Process. 2011, 19, 788–798. [Google Scholar] [CrossRef]
- Lei, Y.; Scheffer, N.; Ferrer, L.; McLaren, M. A novel scheme for speaker recognition using a phonetically-aware deep neural network. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 1695–1699. [Google Scholar] [CrossRef]
- Variani, E.; Lei, X.; McDermott, E.; Lopez-Moreno, I.; Gonzalez-Dominguez, J. Deep neural networks for small footprint text-dependent speaker verification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, 4–9 May 2014; pp. 4052–4056. [Google Scholar]
- Snyder, D.; Garcia-Romero, D.; Povey, D.; Khudanpur, S. Deep Neural Network Embeddings for Text-Independent Speaker Verification. In Proceedings of the Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, 20–24 August 2017; pp. 999–1003. [Google Scholar]
- Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar]
- Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 3830–3834. [Google Scholar]
- Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Gao, S.; Cheng, M.; Zhao, K.; Zhang, X.; Yang, M.; Torr, P.H.S. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 652–662. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Lee, J.; Nam, J. Multi-Level and Multi-Scale Feature Aggregation Using Sample-level Deep Convolutional Neural Networks for Music Classification. CoRR 2017, arXiv:1706.06810. [Google Scholar]
- Gao, Z.; Song, Y.; McLoughlin, I.; Li, P.; Jiang, Y.; Dai, L. Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System. In Proceedings of the Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15–19 September 2019; pp. 361–365. [Google Scholar]
- Yu, Y.; Li, W. Densely Connected Time Delay Neural Network for Speaker Verification. In Proceedings of the Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 25–29 October 2020; pp. 921–925. [Google Scholar]
- Wan, Z.; Ren, Q.; Qin, Y.; Mao, Q. Statistical Pyramid Dense Time Delay Neural Network for Speaker Verification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Singapore, 23–27 May 2022; pp. 7532–7536. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
- Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef] [Green Version]
- Wang, J.; Li, L.; Wang, D.; Zheng, T.F. Research on generalization property of time-varying Fbank-weighted MFCC for i-vector based speaker verification. In Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, Singapore, 12–14 September 2014; p. 423. [Google Scholar]
- Zhang, Y.; Lv, Z.; Wu, H.; Zhang, S.; Hu, P.; Wu, Z.; Lee, H.; Meng, H. MFA-Conformer: Multi-Scale Feature Aggregation Conformer for Automatic Speaker Verification. In Proceedings of the Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Republic of Korea, 18–22 September 2022; pp. 306–310. [Google Scholar] [CrossRef]
- Ko, T.; Peddinti, V.; Povey, D.; Seltzer, M.L.; Khudanpur, S. A study on data augmentation of reverberant speech for robust speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5220–5224. [Google Scholar] [CrossRef]
- Pearce, D.J.B.; Hirsch, H.G. The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proceedings of the Interspeech, Beijing, China, 16–20 October 2000. [Google Scholar]
- Snyder, D.; Chen, G.; Povey, D. MUSAN: A Music, Speech, and Noise Corpus. CoRR 2015, arXiv:1510.08484. [Google Scholar]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of the Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15–19 September 2019; pp. 2613–2617. [Google Scholar]
- Deng, J.; Guo, J.; Yang, J.; Xue, N.; Kotsia, I.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 5962–5979. [Google Scholar] [CrossRef] [PubMed]
- Xiang, X.; Wang, S.; Huang, H.; Qian, Y.; Yu, K. Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition. arXiv 2019, arXiv:1906.07317. [Google Scholar]
- Le-Khac, P.H.; Healy, G.; Smeaton, A.F. Contrastive Representation Learning: A Framework and Review. IEEE Access 2020, 8, 193907–193934. [Google Scholar] [CrossRef]
- Van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2019, arXiv:1807.03748. [Google Scholar]
- Gutmann, M.; Hyvärinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010; Volume 9, pp. 297–304. [Google Scholar]
- Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline. In Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, O-COCOSDA 2017, Seoul, Republic of Korea, 1–3 November 2017; pp. 1–5. [Google Scholar]
- Snyder, D.; Garcia-Romero, D.; Sell, G.; McCree, A.; Povey, D.; Khudanpur, S. Speaker Recognition for Multi-speaker Conversations Using X-vectors. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, UK, 12–17 May 2019; pp. 5796–5800. [Google Scholar]
- Wu, Z.; Yamagishi, J.; Kinnunen, T.; Hanilçi, C.; Sahidullah, M.; Sizov, A.; Evans, N.; Todisco, M.; Delgado, H. ASVspoof: The Automatic Speaker Verification Spoofing and Countermeasures Challenge. IEEE J. Sel. Top. Signal Process. 2017, 11, 588–604. [Google Scholar] [CrossRef] [Green Version]
Layer | Structure | Output | |
---|---|---|---|
1 | Conv1D + BN + ReLU | k = 5, p = 2 | 128 |
2 | SPD-TDNN | (128, 64, 1) | 192 |
SPD-TDNN | (192, 64, 2) | 256 | |
SPD-TDNN | (256, 64, 3) | 320 | |
SPD-TDNN | (320, 64, 1) | 394 | |
SPD-TDNN | (394, 64, 2) | 448 | |
SPD-TDNN | (448, 64, 3) | 512 | |
Conv1D + BN + ReLU | k = 1, p = 0 | 256 | |
SPD-TDNN | (256, 64, 1) | 320 | |
SPD-TDNN | (320, 64, 2) | 394 | |
SPD-TDNN | (394, 64, 3) | 448 | |
SPD-TDNN | (448, 64, 1) | 512 | |
SPD-TDNN | (512, 64, 2) | 576 | |
SPD-TDNN | (576, 64, 3) | 640 | |
SPD-TDNN | (640, 64, 1) | 704 | |
SPD-TDNN | (704, 64, 2) | 768 | |
SPD-TDNN | (768, 64, 3) | 832 | |
SPD-TDNN | (832, 64, 1) | 896 | |
SPD-TDNN | (896, 64, 2) | 960 | |
SPD-TDNN | (960, 64, 3) | 1024 | |
Conv1D + BN + ReLU | k = 1, p = 0 | 512 | |
3 | Statistic Poolong + BN | 1024 | |
4 | FC + BN | 192 |
Model | Parameter (M) | EER (%) | minDCF |
---|---|---|---|
D-TDNN | 2.82 | 2.45 | 0.2617 |
MFA-Conformer | 20.8 | 1.70 | 0.2033 |
SE-ResNet | 23.6 | 1.51 | 0.1308 |
SPD-TDNN | 3.11 | 1.23 | 0.1432 |
ISPD-TDNN | 3.11 | 1.19 | 0.1272 |
JLF-ISPD-TDNN | 3.11 | 1.02 | 0.1221 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Feng, T.; Fan, H.; Ge, F.; Cao, S.; Liang, C. Speaker Recognition Based on the Joint Loss Function. Electronics 2023, 12, 3447. https://doi.org/10.3390/electronics12163447
Feng T, Fan H, Ge F, Cao S, Liang C. Speaker Recognition Based on the Joint Loss Function. Electronics. 2023; 12(16):3447. https://doi.org/10.3390/electronics12163447
Chicago/Turabian StyleFeng, Tengteng, Houbin Fan, Fengpei Ge, Shuxin Cao, and Chunyan Liang. 2023. "Speaker Recognition Based on the Joint Loss Function" Electronics 12, no. 16: 3447. https://doi.org/10.3390/electronics12163447
APA StyleFeng, T., Fan, H., Ge, F., Cao, S., & Liang, C. (2023). Speaker Recognition Based on the Joint Loss Function. Electronics, 12(16), 3447. https://doi.org/10.3390/electronics12163447