Next Article in Journal
Desulfurization Technology for Industrial Fuel Gases Using Natural Adsorption Materials
Previous Article in Journal
Planning and Research of Long-Range LoRaWAN Radio Coverage for Large Areas with Complex Terrain
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Voice Profile Authentication Using Machine Learning †

Department of Communications Equipment and Technologies, Technical University of Gabrovo, 5300 Gabrovo, Bulgaria
*
Author to whom correspondence should be addressed.
Presented at the International Conference on Electronics, Engineering Physics and Earth Science (EEPES’24), Kavala, Greece, 19–21 June 2024.
Eng. Proc. 2024, 70(1), 37; https://doi.org/10.3390/engproc2024070037
Published: 8 August 2024

Abstract

:
In the paper, personalized results are presented in the methodology for monitoring information security based on voice authentication. Integration of sound preprocessing and Machine Learning techniques for feature extraction, training, and validation of classification models has been implemented. The objects of research are staked mixed-test voice profiles. Classifies were selected with quantitative evaluation under a threshold of 90.00% by Naive Bayes and Discriminant Analysis. Significantly improved accuracy to approximate levels of 96.0% was established at Decision Tree synthesis. Strongly satisfactory performance indices were reached at the diagnosis of voice profiles using Feed-Forward and Probabilistic Neural Networks, respectively, 98.00% and 100.00%.

1. Introduction

Voice Recognition is a complex and complicated process that passes through the execution of a sequence of components: (1) Speech Preprocessing; (2) Feature Extraction; (3) Speech Classification; and (4) Recognition [1,2]. Feature extraction modules in voice diagnostics are expressed in the analysis of spectrograms and mel-spectrograms associated with the acquisition of Spectral Features. These are sets of Static or Dynamic Mel-Frequency Cepstral Coefficients (MFCCs) [3,4,5]. The incoming voice spectrograms are received by the Short-Time Fourier Transform (STFT) or the Discrete Cosine Transform (DCT). The generated feature datasets serve as input parametric units for different dimension analytics for generating classification models [6,7,8]. The scientific ones are based on voice classifiers with Machine Learning (ML) and Deep Learning (DL) as Support Vector Machine (SVM), Convolutional Neural Networks (CNNs), Long-Short Term Memory (LSTM), Gated Recurrent Unit (GRU), and Bidirectional-LSTM Network (Bi-LSTM) [9,10,11,12]. The algorithm and neural efficiency are analyzed with the application of heterogeneous datasets under the following conceptual tasks: (1) Gender Classification (GC); (2) Speaker Verification (SV); (3) Language Identification (LID); (4) Regional Dialect Identification (DID); and (5) Channel Classification (CC) [13,14,15].
In relation to the purpose of the present study, experimentally statistical characteristics of sound level were extracted at manipulations with the voices of seven male and female individuals when pronouncing selected speech commands. The procedures included two types of sound measurements: No 1, all acoustic and audio measurements; and No 2, measurements for sound levels below 100 dB performed in fixed temporal intervals of 15 s for every registered vocal profile. The mentioned categories of sound parameters had been registered by a portable device with a sound analyzer application specified. In the process of preliminary selection of informative signs between formed input datasets with different combinations of sound parameters, tests were performed with artificial neuron networks to check the correctness of voice recognition. The best indicators were received at category No. 2, including LAE (A-weighted, sound exposure level), [dBA]; LAeq (A-weighted, equivalent sound level), [dBA]; LAF (A-weighted, fast time constant, sound level), [dBA]; LAS (A-weighted, slow time constant response, sound level), [dBA]; and LAI (A-weighted, frequency weighting and impulse time, sound level), [dBA].
The mentioned spectrum of sound features was selected after benchmarking performance and recognition accuracy assessments using initial variants of neural networks, subsequently used as a basic input unit for training and verification of models for voice profile identification. The paper synthesized a methodology for voice profile identification for security and information access personalization based on probabilistic analysis tools in an entirely MATLAB R2014a version. The following analytics tools were introduced: Discriminant Analysis; Feed-Forward Neural Networks; Probabilistic Neural Networks; CART Decision Trees; and Naive Bayes for classification. Through performance evaluation of analytical modules, final classification models with a high recognition index of voice samples in dataset information were selected.

2. Discriminant Analysis in Voice Profile Recognition

In an initial aspect of the studies, the possibility of adapting Linear Didcriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) classifiers and their subvariants was considered as follows:
  • Diagonal Linear Discriminant Analysis (DLDA);
  • Pseudo Linear Discriminant Analysis (PLDA);
  • Diagonal Quadratic Discriminant Analysis (DQDA);
  • Pseudo-Quadratic Discriminant Analysis (PQDA).
A set of criteria has been adopted to examine the quality of classification models created as follows: (1) Lost; (2) Accuracy; and (3) Misclassifications determined against two basic approaches: Resubstitution and k-fold cross-validation. The results of the indicated activities are shown in Table 1. The obtained quantitative estimates show a significant advantage of Linear compared to Quadratic models. According to Resubstitution, low accuracy thresholds of 55.36% with DLDA and 66.14% using LDA and PLDA were reported. The indicators in cross-validation procedures turned out to be similar, where 66.30%, 55.36%, and 66.36% were found for LDA, DLDA, and PLDA. Relatively better were the results achieved in Quadratic classifiers. Values for Resubstitution Lost in the order of 0.1321 and a minimum indication of 0.2907 were observed—equivalent to 86.79% in QDA and PQDA and 70.93% at DQDA. Regarding cross-validation lost, an accuracy of 85.86%, 70.71%, and 86.36% was obtained for the cases of QDA, DQDA, and PQDA.
According to the analysis of the results, a classification model based on Pseudo-Quadratic Discriminant Analysis was selected, with an approximately expected accuracy of 86.575% when operating with voice samples not involved in the training and test procedures. A specification of the classifier synthesis variables is given in Figure 1. Figure 2 visualizes the results when determining the belonging of the voice samples to the defined classification groups for the models with the highest and lowest established quality indicators, respectively, when using DLDA and PQDA. Correctly classified data were arranged diagonally from upper left to lower right, and Misclassifications were assigned to the remaining groups by matrix rows. The largest amount of incorrectly classified data was found in the sample for voice analysis object “Person No 4”. It should be noted that 1, 20, 61, and 38 samples were assigned to the information compositions of “Person No 3”, “Person No 5”, “Person No 6”, and “Person No 7”, respectively.

3. Feed-Forward and Probabilistic Neural Networks in Voice Identification

In the course of research related to the Artificial Intelligence concept, two categories of neural networks were introduced: Feed-Forward Neural Networks (FFNNs) and Probabilistic Neural Networks (PNNs). The neural validation performance score was prepared on the basis of Mean-Squared Error and Classification accuracy indicators. The variation in declared indicators under stepwise increment neurons in the hidden layer with hyperbolic tangent sigmoid activation for FFNNs and the spread indicator in the Radial Base Function (RBF) Layer with Kernel Transfer Function for PNNs was studied.
The results of the applied series of experiments are presented in Table 2. Regarding the FFNN, the MSE variations were limited to the interval 0.0113 to 0.0527, registered for networks with a content of 35 and 5 hidden neurons. The received equivalents of the established errors according to accuracy criteria were a maximum of 97.90% (Figure 3a at the FFNN model selected) and a minimum threshold of 73.20%, respectively.
For the Probabilistic Neural Network structures created with a fixed number of RBF neurons (Figure 3b), variations in the spread from 0.500 to 0.575 did not cause a change in the error, where a constant level of MSE = 4.0816 × 10−4 was observed. After increasing the criterion from 0.600, a smooth to faster exponential growth of the error begins until the highest value of 0.0051 is reached at the limit spread of 0.925. The analysis shows more adequate behavior of PNNs compared to FFNNs, as the lowest accuracy found does not fall below 98.20%. Similar judgments can be made in a quantitative analysis against the MSE indicator, where there is a threefold lower maximum error for PNN relative to the largest reported MSE value for the feed-forward structures.
The Confusion matrices and Error diagrams in Figure 4 and Figure 5 confirm the advantages of the synthesized FFNN with 35 hidden neurons and the PNN with a variety of spreads from 0.500 to 0.575. An increased minimization of Misclassifications was observed before Linear and Quadratic Discriminant classifiers were assigned by matrix rows to incorrect output groups. A variation range of “−0.7561 to 0.6393” was determined for the network errors in the final FFNN model. Within a relatively close range fall the established greater fluctuations of the errors for voice samples for “Person No 3” to “Person No 7”. A similar point was observed for the lower variations in Person No 1 and Person No 2, which were subject to voice profile identification. Significantly lower are the variations found in the selected PNN classification model, with the presence of sharply limited increases in the fourth and seventh output groups, which can be ignored.

4. Decision Tree Modeling for Voice Profile Recognition

In the next stage of the research, activities were carried out on the modeling of structures for multivariate selection of classification decisions using the CART algorithm. In accordance with the specifics of the model synthesis procedure for voice authentication, 24 classification models were generated, corresponding to a basic structure—Pruning level “0”—and structures with sequential removal of nodal branches—Pruning level “1” to level “23” (Table 3). The applied tests using Resubstitution and cross-validation techniques on the initially generated model with 49 nodes show high levels of accuracy, respectively, 98.50000% and 93.7857%. The minimization of the building nodes at Pruning Level “21” and Level “22” was tied to a significant decrease in the efficiency of Decision Tree (DT) models—a fact for which accuracy around 65.0000% and 57.0000% were observed. At a finite content of structural nodes Pruning level “23”, the classification accuracy dropped to only 14.28571 in Resubstitution and 14.2857% for cross-validation.
Following an analysis of the performance of the models, a selection of the Best Pruning Level “two”—DT model was made with a composition of 41 nodes, responsible for maintaining an optimal solution structure and acceptable classification accuracy. Based on the chosen classification architecture, an approximately expected accuracy when operating with new voice data content of 95.85425% was calculated. The specification of the variables in the overall processes of training, verification, and evaluation of the effectiveness of the models using the Decision Tree method is shown in Figure 6. The distribution of correctly and incorrectly classified voice samples for the found optimum (Pruning Level “two”) and the structure with the worst performance (Pruning Level “23”) is presented in Figure 7. The unsuitability of the last-mentioned structure for multivariate decision selection is clearly confirmed by the successful authentication found only for the first-person object of voice analysis.

5. Naïve Bayes Algorithm in Voice Identification

The last module of the methodology for selecting models for personalizing user access using voice analysis instruments and analytics provides for the implementation of Naïve Bayes (NB) classification. In the NB approach, two variants of the probabilistic description of the functional input data using Gaussian and Kernel distributions were set. With regard to the created NB models with the specified distributions, similar Resubstitution and cross-validation procedures were performed, as in the other approaches, shown in Figure 8. In the case of a normal distribution, the input information set was obtained at 70.93% and 70.64%. Through the introduction of Kernel distribution instruments, an increase to 76.36%, 74.16%, and 75.26% as Resubstitution, cross-validation, and Approximately Expected New Data Accuracy were achieved. The specification reflects the input variables during training, the assigned labels of predictors and classification groups, the created NB models, specific evaluations in the separate phases of the tests for functional belonging, etc.
The Confusion matrices in Figure 9 show the highest recognition rate of voice samples at:
  • first, second, third, and seventh for the Gaussian technique;
  • second, first, third, and sixth persons in the Kernel distribution are included in the target group for voice identification.

6. Conclusions

The empirically established accuracy thresholds for personal authentication based on voice analysis with the proposed methodology show very good applicability regarding Discriminant Analysis, Decision Trees, Feed-Forward, and Probabilistic Networks. In this particular aspect, the developed methodology for the synthesis of models for voice identification is allowed to be implemented in security management systems and user access authorization. Regarding the emerging need to improve the classification accuracy of probabilistic models for crossing the threshold of 80.00%, created on the basis of the Naïve Bayes algorithm, preprocessing procedures in the Frequency Domain of voice profile manipulation were planned. Similar activities would also be of interest to the Linear and Quadratic classifiers. Another important point is the search for new potential Machine Learning Methods and Algorithms with a high success of recognition. Examples include Support Vector Machines, Adaptive Neuro-Fuzzy Interface Systems, and k-Nearest Neighbors, among others.

Author Contributions

Conceptualization, I.B. and G.G.; methodology, I.B., K.S. and G.G.; software, I.B., K.S. and G.G.; validation, I.B., K.S. and G.G.; formal analysis, I.B; investigation, I.B., K.S. and G.G.; resources, K.S.; data curation, I.B., K.S. and G.G.; writing—original draft preparation, I.B., K.S. and G.G.; writing—review and editing, I.B., K.S. and G.G.; visualization, I.B., K.S. and G.G.; supervision, I.B.; project administration, G.G.; funding acquisition, Internal project for Technical University of Gabrovo, Bulgaria. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Detailed information about the presented article can be freely obtained by contacting with authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Dudhrejia, H.J.; Shah, S.A. Speech recognition using neural networks. Int. J. Eng. Res. Technol. 2018, 7, 196–202. [Google Scholar]
  2. Kamble, B.C. Speech recognition using artificial neural network—A review. Int. J. Comput. Commun. Instrum. Eng. 2016, 3, 61–64. [Google Scholar]
  3. Javanmardi, F.L.; Kadari, S.R.; Alku, P.K. A comparison of data augmentation methods in voice technology. Comput. Speech Lang. 2023, 83, 101552. [Google Scholar] [CrossRef]
  4. Alsobhani, A.; Albboodi, H.M.; Mahdi, H.L. Speech recognition using convolutional deep neural networks. J. Phys. Conf. Ser. 2021, 1973, 012166. [Google Scholar] [CrossRef]
  5. Rady, E.R.; Hassen, A.; Nassan, N.M.; Hesham, M.U. Convolutional neural network for Arabic speech recognition. Egypt. J. Lang. Eng. 2021, 8, 27–38. [Google Scholar]
  6. Kadiri, S.R.; Javanmadri, F.J.; Alku, P.I. Investigation of self-supervised pre-trained models for classification of voice quality from speech and neck surface accelerometer signals. Comput. Speech Lang. 2023, 83, 101550. [Google Scholar] [CrossRef]
  7. Isvamko, D.R.; Ryuman, D.P. Development of visual and audio speech recognition systems using deep neural networks. In Proceedings of the International Conference on Computer and Vision, Nizhny Novgorod, Russia, 27–30 September 2021. [Google Scholar]
  8. Graves, A.N.; Mohamed, A.R.; Hinton, G.U. Speech recognition with recurrent neural networks. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013. [Google Scholar]
  9. Song, W.S.; Cai, J. End-to-end deep neural network for automatic speech recognition. J. Comput. Sci. 2015, 1, 1–8. [Google Scholar]
  10. Sheikh, I.P.; Vincent, E.P.; Illina, I.F. Training RNN language models on uncertain ASR hypothesis in limited data scenarios. Comput. Speech Lang. 2023, 83, 101555. [Google Scholar] [CrossRef]
  11. Sridhar, C.D.; Kanhe, A.R. Performance comparison of various neural networks for speech recognition. In Proceedings of the International Conference on Communications Systems, Karaikal, India, 4–8 January 2022. [Google Scholar]
  12. Okay, M.O.; Akin, E.; Asian, O.; Kosunaip, S.; Iliev, T.B.; Stoyanov, I.S.; Beloev, I. A comprehensive survey: Evaluating the efficiency of artificial intelligence and machine learning techniques on cyber security solutions. IEEE Access 2024, 12, 12229–12256. [Google Scholar] [CrossRef]
  13. Shaughnessy, D.K. Trends and developments in automatic speech recognition research. Comput. Speech Lang. 2023, 83, 101538. [Google Scholar] [CrossRef]
  14. Chowdhury, S.A.; Durrani, N.M.; Ali, A.G. What do end-to-end speech models learn about speaker, language and channel information? A layer-wise and neuron-level analysis. Comput. Speech Lang. 2023, 83, 101539. [Google Scholar] [CrossRef]
  15. Rudregowda, S.S.; Patilkulkurni, S.H.; Ravi, V.Y.; Gururaj, H.L. Audiovisual speech recognition based on a deep convolutional neural network. Data Sci. Manag. 2024, 7, 25–34. [Google Scholar] [CrossRef]
Figure 1. Variables in selection of Pseudo-Quadratic Discriminant classifier.
Figure 1. Variables in selection of Pseudo-Quadratic Discriminant classifier.
Engproc 70 00037 g001
Figure 2. Confusion matrices at Diagonal Linear (a) and Pseudo-Quadratic (b) classifiers.
Figure 2. Confusion matrices at Diagonal Linear (a) and Pseudo-Quadratic (b) classifiers.
Engproc 70 00037 g002
Figure 3. Synthesized Feed-Forward (a) and Probabilistic (b) models for voice profile identification.
Figure 3. Synthesized Feed-Forward (a) and Probabilistic (b) models for voice profile identification.
Engproc 70 00037 g003
Figure 4. Matrices of correct (green color) and incorrect (red color) classifications for selected Feed-Forward (a) and Probabilistic (b) neural models for voice profile personalization.
Figure 4. Matrices of correct (green color) and incorrect (red color) classifications for selected Feed-Forward (a) and Probabilistic (b) neural models for voice profile personalization.
Engproc 70 00037 g004
Figure 5. Error diagrams at application procedures of selected FFNN (a) and PNN (b) models.
Figure 5. Error diagrams at application procedures of selected FFNN (a) and PNN (b) models.
Engproc 70 00037 g005
Figure 6. Variables in synthesis procedures of Decision Tree structures for voice authentication.
Figure 6. Variables in synthesis procedures of Decision Tree structures for voice authentication.
Engproc 70 00037 g006
Figure 7. Confusion matrices for Optimal (a) and Worst case (b) Decision Tree classification models.
Figure 7. Confusion matrices for Optimal (a) and Worst case (b) Decision Tree classification models.
Engproc 70 00037 g007
Figure 8. Examine the quality of Naïve Bayes voice profile classification models at Gaussian (a) and Kernel (b) input data distribution.
Figure 8. Examine the quality of Naïve Bayes voice profile classification models at Gaussian (a) and Kernel (b) input data distribution.
Engproc 70 00037 g008
Figure 9. Confusion matrices at voice profile identification models for NB classifiers with Gaussian (a) and Kernel (b) input data distribution.
Figure 9. Confusion matrices at voice profile identification models for NB classifiers with Gaussian (a) and Kernel (b) input data distribution.
Engproc 70 00037 g009
Table 1. Discriminant classifiers in voice profile operating procedures.
Table 1. Discriminant classifiers in voice profile operating procedures.
Type ClassifierResubstitutionCross-Validation
LossAccuracy, %Misc.LossAccuracy, %Misc.
Linear0.338666.144740.340066.00476
DiagLinear0.446455.366250.446455.36625
PseudoLinear0.338666.144740.336466.36471
Quadratic0.132186.791850.141485.86198
DiagQuadratic0.290770.934070.292970.71410
PseudoQuadratic0.132186.791850.136486.36191
Table 2. FFNNs and PNNs at voice profile classification.
Table 2. FFNNs and PNNs at voice profile classification.
Feed-Forward Neural NetworksProbabilistic Neural Networks
Hidden
Neurons
MSEAccuracy,
%
Spread
Indicator
MSEAccuracy,
%
50.052773.200.5004.0816 × 10−499.90
100.041280.000.525--
150.014995.700.550--
200.0143-0.575--
25-96.400.6008.1633 × 10−499.70
300.014995.700.625--
350.011397.900.6500.001299.60
400.0141-0.6750.001499.50
450.0133-0.7000.001899.40
500.014696.800.725--
550.016497.900.7500.002799.10
600.015497.100.7750.002999.00
650.015596.100.800--
700.013797.900.825--
750.017395.400.8500.003398.90
800.0153-0.8750.004398.50
850.014597.500.9000.004798.40
900.014097.100.9250.005198.20
Table 3. Decision Trees based on the CART algorithm for voice profile classification.
Table 3. Decision Trees based on the CART algorithm for voice profile classification.
Pruning
Level
NodesResubstitutionCross-Validation
LossAccuracy, %LossAccuracy, %
0490.01500098.500000.06214393.7857
1430.01928698.071430.06000094.0000
2410.02142997.857140.06142993.8571
3370.02714397.285710.06928693.0714
4350.03071496.928570.06928693.0714
5330.03500096.500000.07142992.8571
6290.04500095.500000.07571492.4286
7270.05071494.928570.08428691.5714
8250.05785794.214290.09071490.9286
9220.07071492.928570.09714390.2857
10210.07571492.428570.10428689.5714
11200.08142991.857140.10857189.1429
12190.08785791.214290.12857187.1429
13170.10214389.785710.14714385.2857
14130.13285786.714290.15571484.4286
15120.14142985.857140.16071483.9286
16110.15071484.928570.16214383.7857
17100.16714383.285710.19928680.0714
1880.20142979.857140.20785779.2143
1970.23071476.928570.23428676.5714
2060.27642972.357140.28000072.0000
2150.35214364.785710.34357165.6429
2240.42857157.142860.42928657.0714
2310.85714314.285710.85714314.2857
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Balabanova, I.; Sidorova, K.; Georgiev, G. Voice Profile Authentication Using Machine Learning. Eng. Proc. 2024, 70, 37. https://doi.org/10.3390/engproc2024070037

AMA Style

Balabanova I, Sidorova K, Georgiev G. Voice Profile Authentication Using Machine Learning. Engineering Proceedings. 2024; 70(1):37. https://doi.org/10.3390/engproc2024070037

Chicago/Turabian Style

Balabanova, Ivelina, Kristina Sidorova, and Georgi Georgiev. 2024. "Voice Profile Authentication Using Machine Learning" Engineering Proceedings 70, no. 1: 37. https://doi.org/10.3390/engproc2024070037

Article Metrics

Back to TopTop