Enhancing Speaker Recognition Models with Noise-Resilient Feature Optimization Strategies
Abstract
:1. Introduction
- We have thoroughly investigated a range of dimension reduction techniques and feature optimization methods specifically designed to tackle the complexities of high-dimensional data within speaker recognition systems.
- Overall, the research emphasizes the significance of feature optimization in speaker recognition systems and highlights the advantages of feature fusion, dimension reduction, and feature optimization techniques.
- The objective is to find the optimal combination of features that not only improves recognition accuracy but also reduces the dimensionality of the feature space, leading to faster computation.
- Our proposed models present robust solutions for improving speaker recognition performance in noisy environments across datasets of various sizes, accommodating different numbers of speakers. From small datasets with 120 speakers to medium ones with 630 speakers and large ones with 1251 speakers, our models demonstrate versatility, making them suitable for a broad range of applications and datasets with diverse scales and characteristics.
2. Related Work
3. Proposed Approach and Methodological Framework
3.1. Motivation
- Feature fusion Methodology: Feature fusion, a method that amalgamates features from diverse sources or databases into a unified, enriched feature set, stands as a pivotal strategy in speaker recognition systems. Spectral features, which encapsulate frequency, power, and other signal characteristics like MFCC, LPC, PLP, centroid, and entropy, provide a robust foundation. Conversely, prosodic features capture auditory properties such as stress, loudness variation, and intonation (e.g., RMS). While spectral features, particularly MFCC, have demonstrated superior performance compared to prosody-based systems (e.g., pitch, RMS), their combined integration offers unparalleled robustness vital for recognition systems. Additionally, feature derivatives enable the quantification of subtle changes in voice signals. By extracting six features and calculating their derivatives and double derivative values, we aim to bolster accuracy [23,24,25,26].
- Dimension reduction: This approach concentrates on reducing the dimensionality of the feature set, optimizing computational processes while retaining crucial information for precise speaker recognition. Principal component analysis (PCA) and independent component analysis (ICA) are utilized in our work for dimensionality reduction. However, it is important to note that feature combination is a complex process that can slow down computation. Thus, careful consideration of PCA and ICA trade-offs is essential for balancing computational efficiency with information retention [37,38,39,40].
- Feature optimization using genetic algorithms (GAs) and the marine predator algorithm (MPA): While dimension reduction accelerates computation, it does not inherently identify the optimal feature set. To address this, we employ feature optimization methods like genetic algorithms (GAs) and the marine predator algorithm (MPA). Feature optimization is vital for enhancing machine learning model performance by selecting the most relevant features, reducing overfitting, improving computational efficiency, and promoting model interpretability. The proposed approach is illustrated in Figure 1, with comprehensive explanations provided in Section 3.2, Section 3.3 and Section 3.4.
3.2. The Feature Fusion Approach (Approach 1)
3.2.1. Feature Extraction Techniques
The Mel-Frequency Cepstral Coefficient (MFCC)
Linear Predictive Coding (LPC)
Perceptual Linear Prediction (PLP)
Spectral Centroid (SC)
Spectral Entropy (SE)
- For a signal x(t), calculate s(f), the power spectral density.
- Calculate the power within the spectral band based on the frequency of interest. Following the calculation of the spectral band power, normalize the power within the specified band of interest.
- Calculate the spectral entropy utilizing Equation (7) [60].
Root Mean Square (RMS)
Delta Features
3.2.2. Feature Fusion Methodology
- All 18 features are tested individually for TIMIT white noise data with 630 speakers.
- Top 2 features with the highest SI accuracy and lowest average EER among the 18 features are selected.
- In determining the best model, the average accuracy and average EER values across three classifiers are considered. LPC and PLP emerge as the first and second-best features, respectively, with the highest average accuracies of 62.1% and 70.4%, surpassing other features. Equation (11) illustrates the calculation of average accuracy using the results from all three classifiers.
- 4.
- In the second stage, two features are fused by individually combining the best features LPC and PLP with the remaining 17 features. Once again, the top two models are selected from this process. The two best models identified in this step are the MFCC and LPC fusion model and the PLP and LPC fusion model.
- 5.
- In the third phase, three features are fused by individually combining the remaining 16 features with the two best models selected from step 2.
- 6.
- The fusion of features 4 to 18 is carried out in a similar manner, and the two best models are chosen at each step. In total, 315 models are tested for TIMIT white noise data. Figure 4 shows the workflow and methodology for feature aggregation using the TIMIT white noise 630 voice database.
Model Optimization
- TIMIT white noise data with 120 speakers,
- TIMIT babble noise data with 120 speakers,
- TIMIT babble noise data with 630 speakers, and
- Voxceleb1 dataset.
3.3. Dimension Reduction Techniques (Approach 2)
3.3.1. Principal Component Analysis (PCA)
- Loading the input data: The feature fusion model, which serves as the input dataset, is loaded into the PCA algorithm.
- Subtracting the mean: The mean of the data is subtracted from each feature in the original dataset. This step ensures that the data are centered around the origin.
- Calculating the covariance matrix: The covariance matrix of the dataset is computed. This matrix captures the relationships and variations among the different features.
- Determining the eigenvectors: The eigenvectors associated with the largest eigenvalues of the covariance matrix are identified. These eigenvectors represent the directions of maximum variance in the dataset.
- Projecting the dataset: The original dataset is projected onto the eigenvectors obtained in the previous step. This projection transforms the data into a lower-dimensional subspace spanned by the eigenvectors.
3.3.2. Independent Component Analysis (ICA)
- Preprocessing: Similar to PCA, the data are typically preprocessed by centroid and scaling the features to ensure a common reference point and equal contribution of each feature.
- Whitening: Whitening is performed to transform the data into a new representation where the features are uncorrelated and have unit variances. This step helps to remove any linear dependencies between the features.
- Defining the non-gaussianity measure: ICA aims to find components that are as statistically independent as possible. Different non-gaussianity measures can be used, such as kurtosis or negentropy, to quantify the departure from gaussianity and guide the separation of independent components.
- Optimization: The main objective of ICA is to maximize the non-gaussianity measure for each component. This is achieved through an optimization process, which involves finding the weights or mixing matrix that maximizes the non-gaussianity measure.
- Iterative estimation: ICA often involves an iterative estimation process to refine the separation of independent components.
3.3.3. Model Optimization Using Dimension Reduction Techniques
- Then, we randomly reduced the feature vectors from 50% to 90% of their original size using PCA and ICA.
- Now, we have 3 new reduced PCA and ICA feature models and one 126 PCA and 126 ICA feature model.
- To evaluate the performance of the reduced dimension models, we employed LD, KNN, and ensemble classifiers.
- Accuracy, EER, and computation timing are calculated for each reduced model. Figure 8 explains the steps involved in model optimization using the dimension reduction technique for each dataset used.
3.4. Feature Optimization (Approach 3)
3.4.1. Genetic Algorithms (GAs) [3,4]
- Initialization: Generate an initial population of solutions.
- Evaluation: Assess each solution’s fitness using a defined function.
- Selection: Choose individuals from the population based on their fitness.
- Crossover: Combine selected individuals to create offspring.
- Mutation: Introduce random changes to offspring.
- Replacement: Create the next generation by combining parents and offspring.
- Termination: Stop the algorithm when a termination condition is met.
3.4.2. The Marine Predator Algorithm (MPA) [5,6]
- Phase 1: The predator moves at a slower pace than the prey, characterized by a high velocity ratio.
- Phase 2: The predator and prey maintain nearly identical speeds, representing a unity velocity ratio.
- Phase 3: The predator accelerates and moves faster than the prey, indicating a low velocity ratio.
- Initialization: Start with a population of marine predators.
- Prey Location: Determine the location of potential prey.
- Predation: Update predator positions towards the prey.
- Encounter: Check if predators have caught the prey.
- Feeding: If caught, adjust predator positions accordingly.
- Behavior Update: Modify predator behavior based on success.
- Termination: Decide when to stop the algorithm.
- Iteration: Repeat steps 2–7 until termination criteria are met.
3.4.3. Model Optimization Using the Feature Selection Approach
3.5. Classification
3.5.1. Linear Discriminant (LD) Classifier
- The function f(x) is an expected probability that x belongs to a particular class and employs a gaussian distribution function. Here, n denotes the number of instances, and K is the number of classes.
- By combining the gaussian distribution into the equation and simplifying, we obtain Equation (14). This function serves as a discriminant, and the class with the highest calculated value is the output classification (y).
3.5.2. K Nearest Neighbor Classification (KNN)
- Choose a value for K, which represents the number of neighbors.
- Compute the Euclidean distance between the unknown data point and its K nearest neighbors.
- Classify the K nearest neighbors based on the computed Euclidean distances.
- Count the number of data points in each class.
- Assign the new data point to the class with the highest count.
3.5.3. Ensemble Classification
- Bootstrap sampling: Create multiple bootstrap samples by randomly sampling with replacement from the original dataset.
- Base learner training: Train a base classifier decision tree on each bootstrap sample independently.
- Voting or averaging: Combine predictions of all base classifiers using majority voting.
4. Evaluation
4.1. Database Preparation
- In this research work, we incorporated the noisy TIMIT speech dataset developed by the Florida Institute of Technology, which consists of approximately 322 h of speech from the TIMIT acoustic-phonetic continuous speech corpus (LDC93S1). The dataset was modified by adding different levels of additive noise while keeping the original TIMIT arrangement intact. For our study, we specifically focused on TIMIT white noise and babble noise with a 30 dB noise level. We selected subsets of the dataset containing 120 speakers for TIMIT babble and white noise and 630 speakers for TIMIT white and babble noise. Each speaker contributed a total of 10 utterances. For TIMIT babble and white noise with 120 speakers, we used 720 voice samples for training and 480 voice samples for testing, resulting in a total of 1200 voices.
- Similarly, for TIMIT babble and white noise with 630 speakers, we used 5040 voice samples for training and 1260 voice samples for testing, totaling 6300 voices [72]. This approach allowed us to make fair comparisons with other studies, including [41,42]. In the context of the TIMIT dataset or any similar speech dataset, when referring to a specific SNR level such as “30 dB”, it typically represents the ratio of the signal power to the noise power on average. Therefore, it refers more to the mean noise level rather than the peak noise level.
- The voxceleb1 dataset is known for its large size, as it contains over 100,000 voice samples. The videos in this database were recorded in diverse and challenging multispeaker environments, such as outdoor stadiums, where real-world noise, such as laughter, overlapping speech, and room acoustics, is introduced to degrade the datasets. For our research paper, we utilized data from 1251 speakers and a total of 153,516 speaker voices. To ensure a fair comparison with [47,48], we carefully selected 148,642 utterances for training and 4874 utterances for testing in the context of speaker verification tasks. For speaker identification, we utilized 145,265 utterances for training and 8251 utterances for testing.
- TIMIT and voxceleb1 voice datasets consist of full English sentences, making it suitable for analyzing speech at the sentence level. The dataset details, along with the number of utterances used for training and testing, are shown in Table 2 for reference.
- We divided our data using the same method as utilized by other researchers to ensure a fair comparison. Specifically, for the TIMIT dataset, we followed the data split used by [46] for 630 speakers and [42,43] for 120 speakers. Similarly, for the VoxCeleb1 dataset, we employed the same data split as described in [47,48,49,50] to ensure consistency and fairness in our comparisons. This approach allowed us to conduct meaningful evaluations while maintaining parity with existing studies.
4.2. Assessing the Effectiveness of Speaker Identification (SI)
4.3. Assessing the Effectiveness of Speaker Verification (SV)
5. Results Discussion
5.1. Optimal Outcomes Utilizing Feature-Level Fusion (Approach 1)
5.2. Optimal Outcomes Utilizing the Dimension Reduction Technique (Approach 2)
5.3. Optimal Results Achieved with the Feature Optimization Technique (Approach 3)
5.4. A Performance Comparison across Feature-Level Fusion, Dimension Reduction, and Feature Optimization Indicates their Overall Effectiveness
5.5. Comparative Analysis of Computational Timing: Feature-Level Fusion, Dimension Reduction, and Feature Optimization Techniques
5.6. Comparing the Proposed Work with the Existing Approach
5.6.1. Result Comparison for TIMIT Babble Noise (120 Speakers)
5.6.2. Result Comparison for TIMIT Babble Noise (630 Speakers)
5.6.3. Result Comparison for TIMIT White Noise (120 Speakers)
5.6.4. Result Comparison for TIMIT White Noise (630 Speakers)
5.6.5. Result Comparison for Voxceleb1 Data (Largest Dataset)
5.7. System Configuration
5.8. The SR Performance Is Influenced by Several Factors, as Observed in This Study
- Feature Fusion: The fusion of more features does not always lead to better SR performance. In some cases, models with smaller numbers of fused features outperform those with more features. This suggests that the careful selection and combination of features are crucial for optimal results.
- Feature Optimization: Among the three proposed approaches, feature optimization with PCA-GA and PCA-MPA delivers the best results in most cases. Notably, it significantly reduces computation timing, making it a promising technique for improving efficiency.
- Impact on Classification: The choice of approach affects the performance of different classifiers. KNN classification benefits from dimension reduction and feature optimization, while LD and ensemble classifiers perform better with feature-level fusion.
- Dataset Influence: The input dataset plays a significant role in SR performance. For TIMIT babble noise data, feature-level fusion and PCA-GA feature optimization demonstrate superior results, while TIMIT white noise data benefit from PCA dimension reduction and PCA-MPA feature optimization. PCA-MPA also performs well on the voxceleb1 dataset.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Chauhan, N.; Isshiki, T.; Li, D. Text-independent speaker recognition system using feature-level fusion for audio databases of various sizes. SN Comput. Sci. 2023, 4, 531. [Google Scholar] [CrossRef]
- Lu, X.; Dang, J. Dimension reduction for speaker identification based on mutual information. In Proceedings of the Eighth Annual Conference of the International Speech Communication Association, Antwerp, Belgium, 27–31 August 2007; pp. 2021–2024. [Google Scholar]
- Zamalloa, M.; Bordel, G.; Rodriguez, L.; Penagarikano, M. Feature selection based on genetic algorithms for speaker recognition. In Proceedings of the 2006 IEEE Odyssey—The Speaker and Language Recognition Workshop, San Juan, PR, USA, 28–30 June 2006; pp. 1–8. [Google Scholar]
- Goldberg, D.E. Genetic Algorithms in Search, Optimization and Machine Learning; Addison-Wesley: Reading, MA, USA, 1989. [Google Scholar]
- Rai, R.; Dhal, K.G.; Das, A.; Ray, S. An inclusive survey on marine predators algorithm: Variants and applications. Arch. Comput. Methods Eng. 2023, 30, 3133–3172. [Google Scholar] [CrossRef] [PubMed]
- Elminaam, D.S.A.; Nabil, A.; Ibraheem, S.A.; Houssein, E.H. An efficient marine predators algorithm for feature selection. IEEE Access. 2021, 9, 60136–60153. [Google Scholar] [CrossRef]
- Yu, D.; Deng, L. Automatic Speech Recognition: A Deep Learning Approach; Springer: London, UK, 2015. [Google Scholar]
- Omar, N.M.; El-Hawary, M.E. Feature fusion techniques based training MLP for speaker identification system. In Proceedings of the 2017 IEEE 30th Canadian Conference on Electrical and Computer Engineering (CCECE), Windsor, ON, Canada, 30 April–3 May 2017; pp. 1–6. [Google Scholar]
- Jin, Y.; Song, P.; Zheng, W.; Zhao, L. A feature selection and feature fusion combination method for speaker-independent speech emotion recognition. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 4808–4812. [Google Scholar]
- Tu, Y.-H.; Du, J.; Wang, Q.; Bao, X.; Dai, L.-R.; Lee, C.-H. An information fusion framework with multi-channel feature concatenation and multi-perspective system combination for the deep-learning-based robust recognition of microphone array speech. Comput. Speech Lang. 2017, 46, 517–534. [Google Scholar] [CrossRef]
- Kinnunen, T.; Li, H. An overview of text-independent speaker recognition: From features to supervectors. Speech Commun. 2010, 52, 12–40. [Google Scholar] [CrossRef]
- Ahmed, A.I.; Chiverton, J.P.; Ndzi, D.L.; Becerra, V.M. Speaker recognition using PCA-based feature transformation. Speech Commun. 2019, 110, 33–46. [Google Scholar] [CrossRef]
- Kumari, T.R.J.; Jayanna, H.S. Limited data speaker verification: Fusion of features. Int. J. Electr. Comput. Eng. 2017, 7, 3344–3357. [Google Scholar] [CrossRef]
- Furui, S. Comparison of speaker recognition methods using statistical features and dynamic features. IEEE Trans. Acoust. Speech Signal Process. 1981, 29, 342–350. [Google Scholar] [CrossRef]
- Kermorvant, C.; Morris, A. A comparison of two strategies for ASR in additive noise: Missing data and spectral subtraction. In Proceedings of the 6th European Conference on Speech Communication and Technology (Eurospeech 1999), Budapest, Hungary, 5–9 September 1999; pp. 2841–2844. [Google Scholar]
- Varga, A.P.; Moore, R.K. Hidden Markov model decomposition of speech and noise. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Albuquerque, NM, USA, 3–6 April 1990; pp. 845–848. [Google Scholar]
- Mittal, U.; Phamdo, N. Signal/noise KLT based approach for enhancing speech degraded by colored noise. IEEE Trans. Speech Audio Process. 2000, 8, 159–167. [Google Scholar] [CrossRef]
- Hu, Y.; Loizou, P.C. Subjective comparison and evaluation of speech enhancement algorithms. Speech Commun. 2007, 49, 588–601. [Google Scholar] [CrossRef]
- Vaseghi, S.V.; Milner, B.P. Noise compensation methods for hidden Markov model speech recognition in adverse environments. IEEE Trans. Speech Audio Process. 1997, 5, 11–21. [Google Scholar] [CrossRef]
- Boll, S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 1979, 27, 113–120. [Google Scholar] [CrossRef]
- Hermansky, H.; Morgan, N. RASTA processing of speech. IEEE Trans. Speech Audio Process. 1994, 2, 578–589. [Google Scholar] [CrossRef]
- Hermansky, H.; Morgan, N.; Bayya, A.; Kohn, P. Compensation for the effect of the communication channel in auditory-like analysis of speech (RASTA-PLP). In Proceedings of the 2nd European Conference on Speech Communication and Technology (Eurospeech 1991), Genovo, Italy, 24–26 September 1991; pp. 1367–1370. [Google Scholar]
- Adami, A.G.; Mihaescu, R.; Reynolds, D.A.; Godfrey, J.J. Modeling prosodic dynamics for speaker recognition. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, Hong Kong, 6–10 April 2003; pp. IV–788. [Google Scholar]
- Kumar, K.; Kim, C.; Stern, R.M. Delta-spectral cepstral coefficients for robust speech recognition. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; pp. 4784–4787. [Google Scholar]
- Sönmez, K.; Shriberg, E.; Heck, L.; Weintraub, M. Modeling dynamic prosodic variation for speaker verification. In Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP 1998), Sydney, Australia, 30 November–4 December 1998; pp. 3189–9192. [Google Scholar]
- Carey, M.J.; Parris, E.S.; Lloyd-Thomas, H.; Bennett, S. Robust prosodic features for speaker identification. In Proceedings of the Fourth International Conference on Spoken Language Processing, ICSLP 9‘6, Philadelphia, PA, USA, 3–6 October 1996; pp. 1800–1803. [Google Scholar]
- Chauhan, N.; Isshiki, T.; Li, D. Speaker recognition using LPC, MFCC, ZCR features with ANN and SVM classifier for large input database. In Proceedings of the 2019 IEEE 4th International Conference on Computer and Communication Systems (ICCCS), Singapore, 23–25 February 2019; pp. 130–133. [Google Scholar]
- Lip, C.C.; Ramli, D.A. Comparative study on feature, score and decision level fusion schemes for robust multibiometric systems. In Frontiers in Computer Education; Sambath, S., Zhu, E., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 941–948. [Google Scholar]
- Alam, M.J.; Kenny, P.; Stafylakis, T. Combining amplitude and phase-based features for speaker verification with short duration utterances. In Proceedings of the 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015; pp. 249–253. [Google Scholar]
- Li, Z.-Y.; He, L.; Zhang, W.-Q.; Liu, J. Multi-feature combination for speaker recognition. In Proceedings of the 2010 7th International Symposium on Chinese Spoken Language Processing, Tainan, Taiwan, 29 November–3 December 2010; pp. 318–321. [Google Scholar]
- Neustein, A.; Patil, H.A. Forensic Speaker Recognition; Springer: New York, NY, USA, 2012. [Google Scholar]
- Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
- Dehak, N.; Kenny, P.J.; Dehak, R.; Dumouchel, P.; Ouellet, P. Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 788–798. [Google Scholar] [CrossRef]
- Roweis, S.T. EM algorithms for PCA and SPCA. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 1998; pp. 626–632. [Google Scholar]
- Bailey, S. Principal component analysis with noisy and/or missing data. Publ. Astron. Soc. Pac. 2012, 124, 1015–1023. [Google Scholar] [CrossRef]
- Delchambre, L. Weighted principal component analysis: A weighted covariance eigendecomposition approach. Mon. Not. R. Astron. Soc. 2014, 446, 3545–3555. [Google Scholar] [CrossRef]
- Ding, P.; Kang, X.; Zhang, L. Personal recognition using ICA. In Proceedings of the ICONIP2001, Shanghai, China, 15–18 November 2001. [Google Scholar]
- Rosca, J.; Kopfmehl, A. Cepstrum-like ICA representations for text independent speaker recognition. In Proceedings of the ICA’2003, Nara, Japan, 1–4 April 2003; pp. 999–1004. [Google Scholar]
- Cichocki, A.; Amari, S.I. Adaptive Blind Signal and Image Processing; John Wiley: Chichester, UK, 2002. [Google Scholar]
- Hyvärinen, A.; Karhunen, J.; Oja, E. Independent Component Analysis; John Wiley & Sons: New York, NY, USA, 2001. [Google Scholar]
- Loughran, R.; Agapitos, A.; Kattan, A.; Brabazon, A.; O’Neill, M. Feature selection for speaker verification using genetic programming. Evol. Intell. 2017, 10, 1–21. [Google Scholar] [CrossRef]
- Al-Kaltakchi, M.T.S.; Woo, W.L.; Dlay, S.; Chambers, J.A. Evaluation of a speaker identification system with and without fusion using three databases in the presence of noise and handset effects. EURASIP J. Adv. Signal Process. 2017, 2017, 1–17. [Google Scholar] [CrossRef]
- Al-Kaltakchi, M.T.S.; Woo, W.L.; Dlay, S.; Chambers, J.A. Comparison of I-vector and GMM-UBM approaches to speaker identification with TIMIT and NIST 2008 databases in challenging environments. In Proceedings of the 2017 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, 28 August–2 September 2017; pp. 533–537. [Google Scholar]
- Zou, X.; Jancovic, P.; Kokuer, M. The effectiveness of ICA-based representation: Application to speech feature extraction for noise robust speaker recognition. In Proceedings of the European Signal Processing Conference (EUSIPCO), Florence, Italy, 4–8 September 2006; pp. 1–5. [Google Scholar]
- Mohammadi, M.; Mohammadi, H.R.S. Study of speech features robustness for speaker verification application in noisy environments. In Proceedings of the 2016 8th International Symposium on Telecommunications (IST), Tehran, Iran, 27–28 September 2016; pp. 489–493. [Google Scholar]
- Meriem, F.; Farid, H.; Messaoud, B.; Abderrahmene, A. Robust speaker verification using a new front end based on multitaper and gammatone filters. In Proceedings of the 2014 Tenth International Conference on Signal-Image Technology and Internet-Based Systems, Marrakech, Morocco, 23–27 November 2014; pp. 99–103. [Google Scholar]
- Okabe, K.; Koshinaka, T.; Shinoda, K. Attentive statistics pooling for deep speaker embedding. arXiv 2018, arXiv:180310963. [Google Scholar]
- Nagrani, A.; Chung, J.S.; Zisserman, A. VoxCeleb: A large-scale speaker identification dataset. arXiv 2017, arXiv:170608612. [Google Scholar]
- Mandalapu, H.; Ramachandra, R.; Busch, C. Multilingual voice impersonation dataset and evaluation. In Communications in Computer and Information Science; Yayilgan, S.Y., Bajwa, I.S., Sanfilippo, F., Eds.; Springer: Cham, Switzerland, 2021; pp. 179–188. [Google Scholar]
- Cai, W.; Chen, J.; Li, M. Exploring the encoding layer and loss function in end-to-end speaker and language recognition system. arXiv 2018, arXiv:180405160. [Google Scholar]
- Lartillot, O.; Toiviainen, P. MIR in Matlab (II): A toolbox for musical feature extraction from audio. In Proceedings of the 10th International Conference on Digital Audio Effects, Bordeaux, France, 10–15 September 2017; pp. 127–130. [Google Scholar]
- Chauhan, N.; Isshiki, T.; Li, D. Speaker Recognition using fusion of features with Feedforward Artificial Neural Network and Support Vector Machine. In Proceedings of the 2020 international conference on intelligent engineering and management (ICIEM), London, UK, 17–19 June 2020; pp. 170–176. [Google Scholar]
- Chakroborty, S.; Roy, A.; Saha, G. Fusion of a complementary feature set with MFCC for improved closed set text-independent speaker identification. In Proceedings of the 2006 IEEE International Conference on Industrial Technology, Mumbai, India, 15–17 December 2006; pp. 387–390. [Google Scholar]
- Ahmad, K.S.; Thosar, A.S.; Nirmal, J.H.; Pande, V.S. A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network. In Proceedings of the 2015 Eighth International Conference on Advances in Pattern Recognition (ICAPR), Kolkata, India, 4–7 January 2015; pp. 1–6. [Google Scholar]
- Slifka, J.; Anderson, T.R. Speaker modification with LPC pole analysis. In Proceedings of the 1995 International Conference on Acoustics, Speech, and Signal Processing, Detroit, MI, USA, 9–12 May 1995; pp. 644–647. [Google Scholar]
- Wang, L.; Chen, Z.; Yin, F. A novel hierarchical decomposition vector quantization method for high-order LPC parameters. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 23, 212–221. [Google Scholar] [CrossRef]
- Daniel, P.W. PLP, RASTA, MFCC and inversion in Matlab. 2005.@misc{Ellis05-rastamat. 2005. Available online: http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/ (accessed on 15 January 2020).
- Hermansky, H. Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 1990, 87, 1738–1752. [Google Scholar] [CrossRef] [PubMed]
- Chauhan, N.; Chandra, M. Speaker recognition and verification using artificial neural network. In Proceedings of the 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), Chennai, India, 22–24 March 2017; pp. 1147–1149. [Google Scholar]
- Ross, A. Fusion, feature-level. In Encyclopedia of Biometrics; Li, S.Z., Jain, A., Eds.; Springer: Boston, MA, USA, 2009; pp. 597–602. [Google Scholar]
- Root-mean-square Value. A Dictionary of Physics, 6th ed.; Oxford University Press: Oxford, UK, 2009. [Google Scholar]
- You, S.D.; Hung, M.-J. Comparative study of dimensionality reduction techniques for spectral–temporal data. Information 2021, 12, 1. [Google Scholar] [CrossRef]
- Vidhya, A. Understanding Principle Component Analysis (PCA) Step by Step. 2020. Available online: https://medium.com/analytics-vidhya/understanding-principle-component-analysis-pca-step-by-step-e7a4bb4031d9 (accessed on 15 March 2020).
- Herault, J.; Jutten, C.; Ans, B. Detection de grandeurs primitives dans un message composite par une architecture de calcul neuromimetique en apprentissage non supervise. In Proceedings of the GRETSI, Nice, France, 20–24 May 1985; p. 536. [Google Scholar]
- Tharwat, A. Independent component analysis: An introduction. Appl. Comput. Inform. 2018, 17, 222–249. [Google Scholar] [CrossRef]
- Zhao, Y.; Sun, P.-P.; Tan, F.-L.; Hou, X.; Zhu, C.-Z. NIRS-ICA: A MATLAB toolbox for independent component analysis applied in fNIRS studies. Front. Neurosci. 2021, 15, 683735. [Google Scholar] [CrossRef] [PubMed]
- Wang, A.; An, N.; Chen, G.; Li, L.; Alterovitz, G. Accelerating wrapper-based feature selection with K-nearest-neighbor. Knowl. Based Syst. 2015, 83, 81–91. [Google Scholar] [CrossRef]
- Subasi, A. Machine learning techniques. In Practical Machine Learning for Data Analysis Using Python; Elsevier: Amsterdam, The Netherlands, 2020; pp. 91–202. [Google Scholar]
- Yao, Z.; Ruzzo, W.L. A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data. BMC Bioinform. 2006, 7, S11. [Google Scholar] [CrossRef]
- Dietterich, T.G. Ensemble learning. In The Handbook of Brain Theory and Neural Networks; Arbib, M.A., Ed.; MIT Press: Cambridge, MA, USA, 2012; pp. 110–125. [Google Scholar]
- Kam, H.T. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 832–844. [Google Scholar] [CrossRef]
- Abdulaziz, A.; Kepuska, V. Noisy TIMIT speech LDC2017S04. In Web Download; Linguistic Data Consortium: Philadelphia, PA, USA, 2017. [Google Scholar]
- Saito, T.; Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef] [PubMed]
- Tharwat, A. Classification assessment methods: A detailed tutorial. Appl. Comput. Inform. 2020, 17, 168–192. [Google Scholar] [CrossRef]
Feature | Number of Feature Vector |
---|---|
MFCC | 13 |
ΔMFCC | 13 |
ΔΔMFCC | 13 |
LPC | 13 |
ΔLPC | 13 |
ΔΔLPC | 13 |
PLP | 13 |
ΔPLP | 13 |
ΔΔPLP | 13 |
Centroid | 1 |
ΔCentroid | 1 |
ΔΔCentroid | 1 |
RMS | 1 |
ΔRMS | 1 |
ΔRMS | 1 |
Entropy | 1 |
ΔEntropy | 1 |
ΔΔEntropy | 1 |
Total feature vectors | 126 |
Information | Voxceleb1 for SI | Voxceleb1 for SV | TIMIT Babble Noise | TIMIT Babble Noise | TIMIT White Noise | TIMIT White Noise |
---|---|---|---|---|---|---|
Total number of speakers | 1251 | 1251 | 630 | 120 | 630 | 120 |
Number of recordings | Undefined | Undefined | 10 | 10 | 10 | 10 |
Total utterances for training | 145,265 | 148,642 | 5040 | 720 | 5040 | 720 |
Total utterances for testing | 8251 | 4874 | 1260 | 480 | 1260 | 480 |
Total number of audio recordings | 153,516 | 53,516 | 6300 | 1200 | 6300 | 1200 |
Source | Open | Open | Linguistic Data Consortium | Linguistic Data Consortium | Linguistic Data Consortium | Linguistic Data Consortium |
Language | English | English | English | English | English | English |
Environment | Multimedia | Multimedia | Noisy | Noisy | Noisy | Noisy |
Features Used (Model) | Classifier Model | Number of Feature Vectors | Training Time (s) | Testing Time (s) | SI Accuracy (%) | SV EER (%) |
---|---|---|---|---|---|---|
MFCC | LD | 13 | 3.3 | 0.5 | 64.6 | 9.2 |
LPC | LD | 13 | 3.5 | 0.7 | 76.2 | 8.7 |
MFCC + LPC | LD | 26 | 3.8 | 0.72 | 86.7 | 6.4 |
LPC + PLP + ΔMFCC + ΔPLP + MFCC + ΔΔentropy + Δentropy + ΔΔRMS + entropy + RMS + ΔLPC + ΔΔPLP (12 features) | LD | 96 | 4.9 | 0.8 | 92.7 | 4.4 |
Features Used (Model) | Classifier Model | Number of Feature Vectors | Training Time (s) | Testing Time (s) | SI Accuracy (%) | SV EER (%) |
---|---|---|---|---|---|---|
MFCC | LD | 13 | 5.4 | 0.5 | 47 | 12.7 |
LPC | LD | 13 | 5.5 | 0.8 | 56 | 9.4 |
MFCC + LPC | LD | 26 | 6.2 | 0.9 | 68.5 | 7.1 |
LPC + PLP + ΔMFCC + ΔPLP + MFCC + ΔΔentropy + Δentropy + ΔΔRMS + entropy + RMS + ΔLPC + ΔΔPLP + ΔRMS + ΔΔLPC (14 features) | LD | 110 | 8.9 | 0.9 | 89.3 | 2.2 |
Features Used (Model) | Classifier Model | Number of Feature Vectors | Training Time (s) | Testing Time (s) | SI Accuracy (%) | SV EER (%) |
---|---|---|---|---|---|---|
MFCC | LD | 13 | 3.2 | 0.5 | 60 | 15.7 |
LPC | LD | 13 | 3.5 | 0.7 | 61.1 | 11.4 |
MFCC + LPC | LD | 26 | 3.6 | 0.7 | 84 | 7.5 |
LPC + PLP + ΔMFCC + ΔPLP + MFCC + ΔΔentropy + Δentropy + ΔΔRMS + entropy + RMS + ΔΔPLP (11 features) | LD | 83 | 4.8 | 1.2 | 93.3 | 1.1 |
Features Used (Model) | Classifier Model | Number of Feature Vectors | Training Time (s) | Testing Time (s) | SI Accuracy (%) | SV EER (%) |
---|---|---|---|---|---|---|
MFCC | LD | 13 | 5.3 | 0.6 | 41 | 16.9 |
LPC | LD | 13 | 5.8 | 0.7 | 40 | 18.4 |
MFCC + LPC | LD | 26 | 6.9 | 0.7 | 59.2 | 11.2 |
LPC + PLP + ΔMFCC + ΔPLP + MFCC + ΔΔentropy + Δentropy + ΔΔRMS + entropy + RMS + ΔLPC + ΔΔPLP (12 features) | LD | 96 | 14.9 | 3.2 | 79.4 | 2.4 |
Features Used (Model) | Classifier Model | Number of Feature Vectors | Training Time (s) | Testing Time (s) | SI Accuracy (%) | SV EER (%) |
---|---|---|---|---|---|---|
MFCC | KNN | 13 | 1071.6 | 38.9 | 58.1 | 44 |
LPC | KNN | 13 | 1215.1 | 41 | 59.6 | 21.2 |
MFCC + LPC | KNN | 26 | 1281 | 48.1 | 77.6 | 11.7 |
LPC + PLP + ΔMFCC + ΔPLP + MFCC + ΔΔentropy + Δentropy + ΔΔRMS + entropy + RMS + ΔLPC + ΔΔPLP + ΔRMS + ΔΔLPC (14 features) | KNN | 110 | 1458.6 | 48.79 | 90 | 4.07 |
Classifier | Total Number of Speakers | Training Time (s) | Testing Time (s) | SI Accuracy (%) | SV EER (%) |
---|---|---|---|---|---|
LD | 120 | 5.8 | 0.8 | 89.8 | 1.09 |
KNN | 120 | 2.24 | 0.9 | 79.8 | 0.77 |
Ensemble | 120 | 5.7 | 1.6 | 85.8 | 30 |
LD | 630 | 10.9 | 4.7 | 89.9 | 1.1 |
KNN | 630 | 2.8 | 2.9 | 82.9 | 0.14 |
Ensemble | 630 | 134 | 11.3 | 81.3 | 1.02 |
Classifier | Total Number of Speakers | Training Time (s) | Testing Time (s) | SI Accuracy (%) | SV EER (%) |
---|---|---|---|---|---|
LD | 120 | 8.5 | 1.7 | 86.9 | 0.9 |
KNN | 120 | 9.9 | 1.6 | 79.4 | 1.2 |
Ensemble | 120 | 9.9 | 2.3 | 84.8 | 1.5 |
LD | 630 | 22.5 | 4.7 | 79.2 | 3 |
KNN | 630 | 4.3 | 3.2 | 73.4 | 0.16 |
Ensemble | 630 | 181.7 | 10.9 | 73.1 | 4.2 |
Classifier | Training Time (s) | Testing Time (s) | SI Accuracy (%) | SV EER (%) |
---|---|---|---|---|
LD | 2206 | 28.9 | 70.9 | 15.3 |
KNN | 2090.9 | 50.9 | 89.7 | 4.5 |
Ensemble | 11,108 | 256.8 | 63.7 | 31.2 |
Method | Classifier | Feature Used | Database | Number of Speaker | Number of Feature Vectors | Training Time (s) | Testing Time (s) | SI Accuracy (%) | SV EER (%) |
---|---|---|---|---|---|---|---|---|---|
PCA | LD | All 18 | TIMIT babble noise | 120 | 126 | 2.5 | 0.7 | 89.9 | 0.9 |
PCA | KNN | ALL 18 | TIMIT babble noise | 630 | 80 | 2.7 | 0.9 | 90.6 | 0.69 |
PCA | KNN | All 18 | TIMIT white noise | 120 | 100 | 5.8 | 1.2 | 93.3 | 0.58 |
PCA | KNN | All 18 | TIMIT white noise | 630 | 126 | 3.08 | 2.9 | 81.4 | 0.13 |
PCA | KNN | ALL18 | Voxceleb1 | 1251 | 126 | 1646 | 72.6 | 94.7 | 2.2 |
Method | Classifier | Feature Used | Database | Number of Speakers | Number of Feature Vectors | Training Time (s) | Testing Time (s) | SI Accuracy (%) | SV EER (%) |
---|---|---|---|---|---|---|---|---|---|
PCA-GA | KNN | All 18 | TIMIT babble noise | 120 | 81 | 1.9 | 0.9 | 85.6 | 0.7 |
PCA-GA | KNN | ALL 18 | TIMIT babble noise | 630 | 90 | 2.4 | 0.8 | 93.5 | 0.13 |
PCA-MPA | KNN | All 18 | TIMIT white noise | 120 | 103 | 2.7 | 1.2 | 87.9 | 0.8 |
PCA-MPA | KNN | All 18 | TIMIT white noise | 630 | 112 | 1.7 | 1.8 | 83.5 | 0.13 |
PCA-MPA | KNN | All 18 | Voxceleb1 | 1251 | 112 | 1374.3 | 42.5 | 95.2 | 1.8 |
Method | Features Used (Model) | Classifier Model | Speech Database | Number of Speakers | Number of Feature Vectors | Optimization Method | SI Accuracy (%) | SV EER (%) |
---|---|---|---|---|---|---|---|---|
Feature-level fusion (approach 1) (Proposed) | LPC + PLP + ΔMFCC + ΔPLP + MFCC + ΔΔentropy + Δentropy + ΔΔRMS + entropy + RMS + ΔLPC + ΔΔPLP | LD | TIMIT babble noise, 30 DB | 120 | 96 | Non | 92.7 | 1.3 |
Feature selection (approach 2) (proposed) | All 18 | KNN | TIMIT babble noise, 30 DB | 120 | 81 | PCA-GA | 85.6 | 0.7 |
Feature selection (approach 2) (proposed) | All 18 | KNN | TIMIT babble noise, 30 DB | 630 | 90 | PCA-GA | 93.5 | 0.13 |
Spectral subtraction [45] | IMFCC | GMM | TIMIT babble noise-10 DB | 368 | 36 | Non | - | 4.3 |
New Feature extraction [46] | Multitaper gammatone cepstral coefficient (MGCC)- thomson | I-vector | TIMIT babble noise 20 DB | 630 | 13 | LDA | - | 6.39 |
Method | Features Used (Model) | Classifier Model | Speech Database | Number of Speakers | Number of Feature Vectors | Dimension Reduction Technique | SI Accuracy (%) | SV EER (%) |
---|---|---|---|---|---|---|---|---|
Dimension reduction (approach 2) (proposed) | All 18 features | KNN | TIMIT white noise, 30 DB | 120 | 100 | PCA | 93.3 | 0.58 |
Feature Selection method (approach 3) (proposed) | All 18 features | KNN | TIMIT white noise, 30 DB | 630 | 112 | PCA-MPA | 83.5 | 0.13 |
Score-level fusion [42] | MFCC, PNCC | GMM-UBM, LLR classifier | TIMIT awgn and G.712 noise 30 DB | 120 | 16 | Non | 75.83 | - |
Score-level fusion [43] | MFCC, PNCC | GMM-UBM | TIMIT AWGN-30 DB | 120 | 16 | Non | 79.17 | - |
ICA feature extraction [44] | ICA | GMM | TIMIT white noise, 20 DB | 100 | 36 | ICA | 63 | - |
Spectral subtraction [45] | IMFCC | GMM | TIMIT white noise- 10 DB | 368 | 36 | Non | - | 7.1 |
New Feature extraction [46] | MGCC Thomson | I-vector | TIMIT white noise- (20 DB) | 630 | 13 | LDA | 8 |
Method | Features Used (Model) | Classifier, Model | Number of Feature Vectors | Number of Speakers | Dimension Reduction Technique | SI Accuracy (%) | SV EER (%) |
---|---|---|---|---|---|---|---|
Feature selection (approach 3) (Proposed) | All 18 | KNN | 112 | 1251 | PCA-MPA | 95.2 | 1.8 |
Score-level fusion [47] | MFCC, DNN | x vector, attentive static pooling | 60 | 1246 | - | - | 3.85 |
Score-level fusion [47] | MFCC, DNN | I vector, | 60 | 1246 | - | - | 5.3 |
Automated pipelined [48] | Short time magnitude spectrogram | CNN + embedding | 13 | 1251 | - | - | 7.8 |
DNN [49] | DNN | x-vector | - | 1251 | - | - | 3.1 |
Temporal average pooling [50] | MFCC | A-Softmax | 60 | 1251 | - | - | 4.46 |
Temporal average pooling [50] | MFCC | CNN-LDE | 60 | 1251 | - | 89.9 | - |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chauhan, N.; Isshiki, T.; Li, D. Enhancing Speaker Recognition Models with Noise-Resilient Feature Optimization Strategies. Acoustics 2024, 6, 439-469. https://doi.org/10.3390/acoustics6020024
Chauhan N, Isshiki T, Li D. Enhancing Speaker Recognition Models with Noise-Resilient Feature Optimization Strategies. Acoustics. 2024; 6(2):439-469. https://doi.org/10.3390/acoustics6020024
Chicago/Turabian StyleChauhan, Neha, Tsuyoshi Isshiki, and Dongju Li. 2024. "Enhancing Speaker Recognition Models with Noise-Resilient Feature Optimization Strategies" Acoustics 6, no. 2: 439-469. https://doi.org/10.3390/acoustics6020024
APA StyleChauhan, N., Isshiki, T., & Li, D. (2024). Enhancing Speaker Recognition Models with Noise-Resilient Feature Optimization Strategies. Acoustics, 6(2), 439-469. https://doi.org/10.3390/acoustics6020024