Analysis and Investigation of Speaker Identification Problems Using Deep Learning Networks and the YOHO English Speech Dataset

Almarshady, Nourah M.; Alashban, Adal A.; Alotaibi, Yousef A.

doi:10.3390/app13179567

Open AccessArticle

Analysis and Investigation of Speaker Identification Problems Using Deep Learning Networks and the YOHO English Speech Dataset

by

Nourah M. Almarshady

^*

,

Adal A. Alashban

and

Yousef A. Alotaibi

Computer Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(17), 9567; https://doi.org/10.3390/app13179567

Submission received: 8 August 2023 / Revised: 21 August 2023 / Accepted: 23 August 2023 / Published: 24 August 2023

(This article belongs to the Special Issue Automatic Speech Signal Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The rapid momentum of deep neural networks (DNNs) in recent years has yielded state-of-the-art performance in various machine-learning tasks using speaker identification systems. Speaker identification is based on the speech signals and the features that can be extracted from them. In this article, we proposed a speaker identification system using the developed DNNs models. The system is based on the acoustic and prosodic features of the speech signal, such as pitch frequency (vocal cords vibration rate), energy (loudness of speech), their derivations, and any additional acoustic and prosodic features. Additionally, the article investigates the existing recurrent neural networks (RNNs) models and adapts them to design a speaker identification system using the public YOHO LDC dataset. The average accuracy of the system was 91.93% in the best experiment for speaker identification. Furthermore, this paper helps uncover reasons for analyzing speakers and tokens yielding major errors to increase the system’s robustness regarding feature selection and system tune-up.

Keywords:

classification; DNNs; English; speaker identification; speech; YOHO speech dataset

1. Introduction

Speaker identification is the process of determining who a person is by comparing their voice to a list of registered and known users. In addition, there is a difference between the act of authentication (also known as speaker verification) and identification. Speaker recognition in terms of identifying and verifying the identity of humans depends on many features that the signal of the human voice could provide, like acoustical information (short-time features) and prosody (speech pace, pitch period, stress, etc.) [1]. The speech signal’s acoustic features are commonly used when developing a speaker recognition system. These features are supposed to be effective, redundant-free, and minimal concerning the original size of the speech signal waveform. Much speaker recognition research has used DNNs as a classification model. Various models in speaker recognition are used: vector quantization (VQ) [2], hidden Markov model (HMM) [3], and deep neural networks (DNNs) [4]. The VQ technique operates in speaker recognition models with a text-independent system. VQ techniques are used in speaker recognition models, which map large vector spaces into dimensionally small finite spaces. It is a dimension-reducing approach in that the reduced space is divided into clusters, and the cluster’s center is called a codebook. By choosing the closest codebook vector to the input, a classification algorithm is used for evaluation [2]. For text-dependent systems, VQ shows unreliable and poor results [3]. Recently, the success of DNNs for speaker identification models spurred researchers to adopt this approach. DNNs could be used to generate and classify a feature, or in some research, feature extraction steps are carried out before evaluating DNNs.

Our contributions in this article are summarized as follows: (1) we propose a system based on developed DNNs models that identify speakers and use the pitch as the main feature, in addition to acoustic features such as energy; (2) we evaluate the proposed system using the YOHO speech dataset developed by the Linguistic Data Consortium (LDC) [5]; (3) we enhance the audio signal by pre-processing the speech files, segmenting the utterance into frames, and computing the fundamental frequency F0; (4) we make calculations to determine delta pitch (∆F0) and delta energy (∆E) that could help clarify and identify speaker identity by studying the change in vocal cord vibration of each speaker; (5) we state our outcomes, observations, and results about the system and the used model for comparison with previous studies.

The rest of the manuscript is structured as follows: in Section 2, we provide a literature review of the speaker recognition process, including identification and verification tasks, along with relevant citations to prior works; in Section 3, the chosen dataset is presented; in Section 4, we show the methodology of the proposed system in detail; in Section 5, we evaluate the proposed system using different models; in Section 6, we present and discuss the results; finally, in Section 7, we provide guidelines for future research.

2. Literature Review

Many speaker recognition methods were proposed by researchers, such as text-dependent or text-independent systems and speaker identification or speaker verification systems that use pitch or other features as system inputs. The authors in [6] developed a voice recognition methodology that depends on the acoustic signal of the vocal cords’ vibration to identify the person. The methodology works through pre-processing and feature extraction steps based on a new measurement technique for the vocal cords’ vibration frequency, unlike the existing strategy that deals with the speaker’s voice. The methodology provides 91.00% accuracy in identifying the desired speaker.

Nasr et al. [7] proposed a speaker identification system with a normalized pitch frequency (NPF) incorporated with mel frequency cepstral coefficients (MFCCs) as feature input for improving the speaker identification system’s performance. The MFCCs were extracted as a feature of the system, and then a normalization step for the pitch detection approach was implemented. After feature extraction, the proposed speaker identification system was evaluated with an artificial neural network (ANN) using a multi-layer perceptron (MLP) model and classifying the speaker based on the new proposed feature NFP incorporated with MFCCs. The experiment includes seven cases of feature extraction. Simulation results show that the proposed system improves the recognition rate.

An et al. [8] used two prototypical deep convolutional neural networks (CNNs): the visual geometry group (VGG-CNN) and residual networks (ResNet) and selected the VoxCeleb dataset to evaluate the proposed system. The authors detailed the network training parameters with varying learning rates to warm up the learning process. They fixed the length of the input sequence up to 3 s. They compared the state-of-the-art methods to evaluate the proposed approach’s effectiveness, such as traditional i-Vector-based methods, ResNet with and without the proposal self-attention layer, and VGG like CNN with and without the proposed self-attention layer. The result for the speaker identification system shows that the proposed two methods reached 88.20% and 90.80% top 1 accuracy, respectively.

Moreover, Meftah et al. [9] proposed a speaker identification system in an emotional state using CNN and long short-term memory (LSTM) to design a convolutional recurrent neural network (CRNN) system. They selected the King Saud University Emotion (KSUEmotion) corpus for Modern Standard Arabic (MSA) and the Emotional Prosody Speech and Transcripts (EPST) corpus for the English language. The experiment investigated the impact of language variation and the speaker’s emotional state on system performance. For the pre-processing step, the unvoiced and silent part of the speech signal was ignored. The proposed system recognizes speakers’ regard for their emotional state. The accuracy of the experiment shows a good performance with 97.46% for the Arabic language, 97.18% for the English language, and 91.83% for cross-language, which is highly promising considering that only one input feature was used in the system, which is the spectrogram. Jahangir et al. [10] evaluated a novel fusion of MFCCs and time-based features, which were combined to improve the accuracy of the speaker identification system. The system was developed by a feedforward neural network (FFNN). A customized FFNN was used as a classifier to identify the speaker. To train the FFNN, the MFCCs feature was fed to identify speakers based on the utterance’s unique pattern. To avoid overfitting, different training functions were used. The experimental results for the proposed MFCCT showed overall accuracy between 83.50% and 92.90%.

Jakubec et al. [11] recently evaluated a text-independent speaker recognition system using deep CNN. They experimented with two comprehensive architectures: ResNet and VGG and selected the VoxCeleb1 dataset for the proposed system. Two types of speaker recognition (SR) approaches, speaker identification (SI) and speaker verification (SV), were used. According to the experiment results, the best accuracy was achieved by the neural network based on ResNet-34 in both SI and SV. Singh [12] used CNN to develop a speaker recognition model. The system experimented with a novel combination of low-level and high-level features using MFCCs. They used two classifiers to enhance the system’s performance: support vector machine (SVM) and k-nearest neighbors (KNN). To sum up, the results in Table 1 illustrate the different experiment accuracy based on different SNR levels.

Vandyke et al. [13] developed voice source waveforms for utterance-level speaker identification using SVM: each speech signal of the desired YOHO dataset is processed for the feature extraction phase. The feature extraction process goes through many steps to determine the pitch period for each frame. The source-frame feature is then used as input for the SVM model. For the experiment, they expressed multi-class SVM and single-class SVM regression. The experiment results were 85.30% and 72.50% for multi-class and single-class, respectively. Shah et al. [14] proposed a two-branch network to extract features from face and voice signals in a multimodal system and tested by SVM. The results indicate an overlap between face and voice. Furthermore, they demonstrated that facial data enhances speaker recognition performance.

Hamsa et al. [15] evaluated an end-to-end framework for speaker identification under challenging circumstances, including emotion and interference, using a learned voice segregation and speech VGG. Using the Ryerson audio-visual dataset (RAVDESS), the presented model outperformed recent literature on emotional speech data in English and Arabic, reporting an average speaker identification rate of 85.2%, 87.0%, and 86.6% using the RAVDESS, speech under simulated and actual stress (SUSAS) dataset, and Emirati-accented speech dataset (ESD), respectively. Despite that, since speech carries a wealth of information, obtaining salient features that can identify speakers is a difficult problem in speech recognition systems [16,17].

Table 1. Related works comparison with results.

Reference	Year	Dataset	Speaker Recognition		Features	Technique	Accuracy (%)
Reference	Year	Dataset	SI	SV	Features	Technique	Accuracy (%)
[13]	2013	YOHO	✓		Pitch	SVM	Multi-class: 85.30 Single-class: 72.50
[6]	2017	-	✓		Spectrogram	Correlation	91.00
[8]	2019	VoxCeleb	✓		MFCCs	VGG ResNet	88.20 90.80
[9]	2020	KSU-Emotions, EPST	✓		Spectrogram	CRNN	Arabic: 97.46 English: 97.18
[10]	2020	LibriSpeech	✓		MFCCT	FFNN	92.90
[18]	2020	-	✓		MFCCs, Pitch	Correlation	92.00
[11]	2021	VoxCeleb1	✓	✓	Spectrogram	VGG ResNet	SV: 5.33 EER SI: 93.80
[12]	2023	-	✓		MFCCs	CNN	92.46
[14]	2023	VoxCeleb1	✓		Two-branch network	SVM	97.20
[15]	2023	RAVDESS SUSAS ESD	✓		Voice Segregation, Speech VGG	New pipeline	85.20 87.00 86.60
[19]	2023	VoxCeleb1	✓		Pre-feed-forward feature extractor	SANs	94.38
[20]	2023	DEMoS	✓		Spectrogram, MFCCs	CNN	90.15

3. YOHO LDC Dataset

The considered dataset for this work is a specialized and public corpus, namely, the YOHO speech dataset managed and published by LDC [5]. This speech dataset contains a large-scale, high-quality speech corpus to support text-dependent speaker authentication research used in secure access technology. The data were collected in 1989 by ITT under a US government contract but have not been available for public use before. The number of trials is thus sufficient to permit evaluation testing at high confidence levels. The corpus is divided into “enrolment” and “verification” segments, containing data from all 138 speakers. The enrolment session is the training session, and the verification session is conducted in another period to test the dataset. Figure 1 details the dataset files. The speakers are numbered in the range of 101 up to 277 (138 speakers). Each speaker has four enrolment sessions per subject (as shown in the enroll part) with 24 phrases per session with a total of 96 files per speaker for training and ten test sessions per subject (as shown in the verify part) with four phrases per session with a total of 40 files per speaker for testing.

In each session, a speaker was prompted with a series of phrases to be read aloud. Each phrase was a sequence of three two-digit numbers (e.g., 35–72–41, pronounced thirty-five, seventy-two, and forty-one). Each speaker had four enrolment sessions of 24 utterances each and ten verification sessions of four utterances each, 136 utterances in 14 sessions per speaker, as illustrated in Table 2. The sample rate for the speech files was 8 kHz, and the total data were 1.5 gigabytes. Related speaker verification and identification work were evaluated on the selected dataset [21].

4. Methodology

This section describes the proposed speaker identification system.

4.1. Pre-Processing of Speech Signals

Before feature extraction, a pre-processing step evaluates, for each speech recording, the signal process by normalization, speech detection, noise removal, and framing to extract the desired features.

4.1.1. Signal Normalization

Audio signal normalization refers to the change in overall volume to equalize within the same speech file and for all speech files considered for training and testing subsets. It differs from compression and does not affect the sound and language content and identities of the speaker. The normalization changes the amplitude, which limits the signal range from −1 to 1 [22]. We express Equation (1) in the audio signal, the original signal samples divided by the max of the absolute value of the signal:

N o r m a l i z e d S i g n a l = \frac{o r i g i n a l s i g n a l}{m a x (a b s (w h o l e o r i g i n a l s i g n a l))} .

(1)

4.1.2. Speech Detection

The speech detection and isolation algorithm is applied to the audio signal to remove the silence and extract the most beneficial information from the signal. In this stage, we use the MATLAB platform to detect speech functions. The detect speech function can be described in the block diagram as shown in Figure 2, with the following steps [23]:

(1): The audio signal is converted to a time–frequency representation using a specified window and overlap length.
(2): For each frame, the short-term energy and spectral spread are calculated.
(3): Histograms are created for both short-term energy and spectral spread distributions.
(4): The threshold is determined for each histogram.
(5): The short-term energy and the spectral spread are smoothed by passing through successive five-element moving median filters across time.
(6): Creating a mask by comparing the short-term energy and spectral spread with their respective thresholds. The feature must be above its threshold to declare a frame containing speech.
(7): The masks are combined.
(8): Merging the regions declared as speech.

The result of the detected speech algorithm for the audio signal is illustrated in Figure 3.

In addition, we scan and verify the dataset files of all 138 speakers before and after speech detection, as provided in Table 3.

4.1.3. Frame Analysis and Windowing

The speech signal is quasi-stationary and slow-varying, which means the signal needs to be segmented in terms of short-time analysis. The segmentation process is called framing. Framing spills the speech signal into a fixed length of frames; each frame is around 20~40 milliseconds. The speech signal segments must overlap to capture all the signal behaviors [24]. In our experiments, we set the frame size to 30 milliseconds, with ten milliseconds overlap. After framing the speech signal, the frame needs enhancement to fix the cut-off of the signal. That windowing step is necessary. The framing of the signal is carried out in the time domain, but windowing could be conducted either in the time or frequency domain. We apply a Hamming window in each frame. Equation (2) shows the Hamming window equation [22]:

w_{H} [n] = \{\begin{array}{l} 0.54 - 0.46 \cos (\frac{2 π n}{N - 1}) \\ 0, n otherwise . \end{array}, n = 0, 1, \dots ., N - 1 .

(2)

4.2. Features Extraction

Feature extraction is a significant step in developing a speaker identification system. The feature extraction process for the speech signal can be presented in the diagram in Figure 4.

4.2.1. Pitch (F0)

Pitch is a fundamental frequency of the voiced speech signal produced by the vibration of vocal cords while vocalizing any voiced phoneme. The periodicity of the signal makes pitch estimation easier. In other words, we can define the pitch of speech signals as the vocal cord’s vibration rate. The reliability of pitch detection algorithms is why pitch is used in speaker recognition systems [25]. Pitch detection algorithms could be evaluated in various methods such as pitch estimation filter (PEF), cepstrum pitch determination (CEP), and normalized correlation function (NCF); e.g., in our current research, we used NCF to detect pitch. NCF calculates the pitch value in a sequence of steps as described in Figure 5, as follows [26]:

(1): Filter the speech signal with a 1 kHz low-pass filter.
(2): Emphasize the signal by the third power to raise the amplitude.
(3): The pitch period duration is analyzed by short-time correlation.
(4): The peak is chosen to produce a valid pitch, if applicable. This can be computed from the voiced segment of the spoken speech.

In this research, we proposed novel features to develop a speaker identification system. The main concept is to use the acoustic features of speech signals along with their first and second derivatives, which means the pitch frequency (F0) and the first pitch derivative ∆F0 and second pitch derivative ∆∆F0 are considered. The following equations determine the derivative calculation.

Equation (3) is the first derivative equation. Where x: is the pitch of the nth frame and n: is the current frame.

{∆ x}_{n} = x_{n} - x_{n - 1} .

(3)

Equation (4) is the second derivative equation.

{∆ ∆ x}_{n} = {∆ x}_{n} - {∆ x}_{n - 1} .

(4)

4.2.2. Log Energy (E)

The energy is the sum of the squared value of the signal, as shown in Equation (5) [22]. In our case, it is defined as the sum of the squared values of each frame.

\log E = \log \sum x^{2} .

(5)

The energy represents the human voice’s loudness, which could not handle any meaningful information, and we consider the derivation. The first and second derivatives of the energy could hold the speaker’s information. The calculation for ∆E and ∆∆E was conducted as follows:

Equation (6) is the first derivative equation, where E is the energy of the nth frame and n is the current frame.

{∆ E}_{n} = E_{n} - E_{n - 1} .

(6)

Equation (7) is the second derivative equation.

{∆ ∆ E}_{n} = {∆ E}_{n} - {∆ E}_{n - 1} .

(7)

4.2.3. Mel Frequency Cepstral Coefficients (MFCCs)

Mel frequency cepstral coefficients (MFCCs) are a simulation representation of the auditory system of humans. The methodology of MFCCs makes them a convenient feature that captures the speech signal [27]. The advantage of MFCCs is the feature extraction algorithm’s lower complexity of implementation [28]. MFCCs produce number selection coefficients that are the outcomes of the speech signal. These coefficients are calculated in several steps, as described in Figure 6.

4.3. Deep Neural Network Model

To achieve the stated goal, we suggested the following methodology to develop the speaker identification system using the DNN model:

(1): Preparation of the training data: The speech signal will be segmented into frames, and the frames with no pitch estimation value, either unvoiced or silent, will be ignored.
(2): Training the classifier: DNN will evaluate the speaker identification system and estimate feature vectors. As shown in Figure 7, the system model is illustrated.
(3): Testing the classification: We will use the verification files on the dataset to test the model.

Subsequently, the DNN result will be documented and analyzed. The critical point for any speaker identification system is the input features that provide a robust model with great accuracy. The system’s output will be classes of a 2D matrix that the train system will predict which class the speaker belongs to in terms of identifying the speakers.

Performance Evaluation

To evaluate the system’s accuracy and robustness, we will use an accuracy rate equation that refers to the probability of correctly identifying the speaker. The total number of successfully recognized speakers’ classes is divided by the total number of tested speakers’ classes, and it can be formulated as Equation (8):

A c c u r a c y r a t e = \frac{c o r r e c t l y p r e d i c t e d c l a s s}{t o t a l t e s t i n g c l a s s} \times 100 .

(8)

Additionally, other standard evaluation matrices, such as precision, recall, and F1 score, are described below in detail.

The precision, recall, and F1 score measurements depend on the system’s confusion matrix. The confusion matrix summarizes the prediction results of a classification problem. It shows how the classification model becomes confused when it makes predictions. It provides insight into the errors made by the classifier and the types of mistakes made. Table 4 illustrates the confusion matrix, and from it, we calculate Equations (9)–(11).

P r e c i s i o n = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e P o s i t i v e},

(9)

R e c a l l = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e N e g a t i v e},

(10)

F 1 = 2 (\frac{P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}) .

(11)

(1): Precision: Percentage of correct positive classes to the predicted positive values.
(2): Recall: Refer to the misprediction classes by calculating the percentage of correct positive values.
(3): F1 measure: Harmonic mean of precision and recall metric [29].

5. Experiments

This section presents three DNN models we developed to evaluate our proposed system: MLP, which was initially developed with ten speakers, five males and five females, and LSTM, which was set with 138 speakers, 106 males and 32 females. Furthermore, BLSTM was developed with the same LSTM speakers to compare the two approaches. Coding was carried out using MATLAB R2022b software. The simulation equipment was a MacBook Pro 2.3 GHz dual-core Intel Core i5, 8 GB RAM (Jarir store, Riyadh, Saudi Arabia).

5.1. Multi-Layer Perceptron (MLP)

Multi-layer perceptron (MLP) or feedforward neural networks (FNNs) consist of multiple layers, including the input layer, hidden layers, and finally, the output layer. The layers were connected by units called “neurons”. The MLP uses various classifiers and training functions based on the purpose of the model developed for [27]. In our research, we developed a speaker identification system using the MLP model. The model input was 40 vectors per speaker and per speech file taken from the data set, which refers to 20 frames for pitch F0 and 20 frames for ∆F0, and the model was expressed by ten speakers, five males and five females. The architecture of the model is shown in Figure 8.

The parameters used by the system are described in Table 5.

5.2. Recurrent Neural Networks (RNNs)

Long short-term memory (LSTM) is one of the recurrent neural network (RNNs) approaches. RNNs are the extinction of FNNs by adding recurrent connections to the layers. In each layer, they use the previous state of the model as an additional input. This addition creates a memory of all the earlier information in the hidden layers. LSTM consists of multiple RNNs with the mechanism to selectively forget or add previous data. The recurrent connection of the RNN expresses the sharing among parameters over time. It is a powerful tool used to deal with temporal data. The design of the RNN layers has an enormous impact on handling sequential data due to the ability of the RNNs used in state-of-the-art speaker recognition systems [27,30].

In this article, we developed our proposed system with the two well-known RNN approaches: LSTM and bidirectional LSTM. BLSTM is a portion of LSTM, except it uses both directions for the input data in the layer; standard LSTM is known as unidirectional LSTM. BLSTM can produce meaningful output compared with LSTM. BLSTM enabled us to go through a wide range of data using forward and backward layers in both directions [31]. Research shows that deeper layers perform better for the RNN models (LSTM and BLSTM) architecture. According to research, the number of layers on the network affects performance. Initially, the performance improves when adding layers, but as more layers are added, the performance decreases, likely due to overfitting the model. Based on that, our two adaptive RNN models use the same architecture: a deeper network with three layers stacked with dropout layers. The number of neurons is set to 100 per layer. The increased number of neurons helps analyze complex input [32]. The following subsections present the two approaches’ details.

5.2.1. Long Short-Term Memory (LSTM)

In this step, we developed an LSTM model with a sequence input layer and several input neurons (our different considered features): F0, ∆F0, ∆∆F0, Log E, ∆E, ∆∆E, MFCCs, ∆MFCCs, and ∆∆MFCCs. Three stacked LSTM layers follow the sequence input layer, with 100 units for each. The three LSTM layers were subjected to a dropout of 0.25. After that, a fully-connected layer with the number of neurons (classes) is associated at the end of the Softmax layer added to classify the system. Figure 9 illustrates the architecture of the LSTM model.

To split the data on any model, the splitting ratio of the training subset and testing subset is a manner rule. Based on the investigation in [33], data splitting influences performance. The ratio of 70/30 for training and testing presents the best performance compared with the other ratios. For our model, we set the data to 70/10/20 for training, validation, and testing. The result shows that the error of the ANN model increased when the amount of data in the training data set increased [33]. The model was trained with the proposed feature and using specific parameters, as illustrated in Table 6.

5.2.2. Bidirectional Long Short-Term Memory (BLSTM)

In this step, we developed an LSTM model, as shown in Figure 10, and for the input features, we used the same used on the LSTM model. The training information is described in Table 6.

For the last model, we expressed the system among five-run experiments. The average accuracy, recall, precision, and F1 score obtained the results. The five runs were used to obtain information from the system to analyze errors.

6. Experiments Results and Discussion

The experiment results were obtained using several deep learning architectures to measure the different approaches’ accuracies and to conduct the required analysis and evaluations.

6.1. MLP Model Results

For the MLP as the used architecture, the system reached 88.90% accuracy, which is a high rate based on one feature: pitch. Figure 11 presents the confusion matrix result for each speaker; for better understanding, the symbol S_10 M refers to speaker 10, male gender. As shown clearly in the confusion matrix, all the mistakes/errors of the system involve mostly the same gender of the speakers; in other words, a given male speaker is mostly confused (or confused by) another male speaker but not female. On the other hand, the female speaker is confused by only a female speaker. For example, speaker S_10M has four misprediction errors, one for speaker S_11M and three for speaker S_5M, all male.

6.2. RNNs Models Results

6.2.1. LSTM Model Results

The LSTM model’s accuracy, recall, precision, and F1 score are shown in Table 7. The worst experiment percentage is underlined, while the highest/best accuracy is bold, as shown in the table.

6.2.2. BLSTM Model Results

The error analysis contains the worst speaker recognition rate (most errors compared to others). The average overall results of the five-run experiments of the system are shown in Table 8.

According to the considered public speech dataset, the male speakers are 106, and the female speakers are 32, each with 136 associated audio files, subdividing 20% of the dataset to equal 28 files for each speaker testing the model. Each run’s error is counted based on the system execution log file.

6.3. Perceptual Test Results

We performed an actual human perceptual test to check and validate system performance and to analyze its errors regarding the most mistaken test token. The human perceptual testers were six normal adult individuals. They were four men and two women.

The selected dataset speakers with the highest number of errors were speaker 14M, with a 20.71% average rate of errors, speaker 24M, with an average of 23.57%, and speaker 27M, with 21.43%. Speaker 45M had an average error rate of 22.14%, speaker 49M had an average error rate of 22.86%, speaker 62M had an average error rate of 20.71%, and speaker 134F had an average error rate of 22.14%. The audio files chosen depend on the statistical analysis of the speakers that were missed by the system based on the graph.

The perceptual test results are depicted in Table 9. In explanation, we could refer to the listener by their number and gender as 1M, which means speaker index 1, male gender, in the perceptual test participants. The triggering goal of this research was to design automatic speaker recognition. However, a human perceptual test was considered to verify and uncover if our system is correct after providing high errors for specific speakers.

After analyzing these data and comparing them with system output, it was found that most listeners failed to predict more than 10 samples out of the 35 (i.e., with an error of more than 28.5%). These perceptual test results coincide with the designed automatic system in this work. Therefore, we can somehow conclude that the system we created with the suggested features closely resembles the human auditory and natural ability system, and any errors made by the system are likely due to issues with the used data set speakers and/or recordings. In other words, the main goal of considering a human perceptual test for the fatally recognized parts of our test subset is to have some cues about the correctness of the system’s abnormal errors for specific speakers.

6.4. Discussion

6.4.1. Compare the Impact of the Features in Our Model

In this research, we conducted a final experiment where we represented the system using the chosen features with different combinations of all considered features. To establish a baseline for comparison, we utilized the BLSTM model and its associated hyper-parameters. The outcomes of this experiment are presented in Table 10. The best features experiment percentage is bold, while the worst experiment result is underlined, as shown in the table.

As shown in Table 10, different combination features were used separately in the first three experiments, and as clearly shown, MFCCs are provided high accuracy compared to others. However, there is not much difference if we compare the third and fifth experiments, where pitch and energy information were added to MFCCs. The reason is that MFCCs contain enough information.

From those experiments, we came up with the following observations:

(1): The primary outcome is pitch cannot be associated with MFCCs because this will degrade the accuracy dramatically. See Exp_3, 4, 5, and 7.
(2): MFCCs are superior in accuracy compared to combined or alone pitch and energy. See Exp_1 and Exp_2.
(3): The pitch will cause MFCC accuracy degradation, but on the other hand, energy will enhance MFCC accuracy. See Exp_5 and Exp_6.

From Table 10 and the above explanation, we can see that if unnecessary features are added to the feature vector, then the automatic system output is degraded because of the burden of redundant (and not useful) input data. For example, what we can see from the table is from Experiment 5 and Experiment 6. In Experiment 6, the accuracy dropped when only pitch information was added.

6.4.2. Comparisons with State-of-the-Art Speakers Identification System

This section compares the results obtained in this work with state-of-the-art speaker identification models. The comparison is affected by the dataset used and the model was developed. We compared the proposed novel features and other used features in such systems. Table 11 briefly compares the most related papers that represent speaker identification features.

As we can see from Table 11, our introduced features introduced a new method that uses the first and second derivatives with the selected features. Our proposed features outperform all state-of-the-art results. The developed speaker identification system in [13] uses our selected dataset. The YOHO dataset was recently used on speaker verification systems, which are not in our field. On the other hand, the identification systems that were developed using spectrograms as features showed excellent results.

Figure 12 presents a comparative evaluation against several state-of-the-art models to evaluate the proposed approaches’ performance. The proposed BLSTM achieved a 92.70% competitive with most state-of-the-art methods. However, the modified approaches of DNN, especially CNN, provide higher accuracy than our two models.

7. Conclusions

In this article, we presented the development of a speaker identification system using deep neural networks by proposing novel features to evaluate the system. The features that we considered were pitch, energy, MFCCs, and their first and second derivatives. To design our system, we considered a public English speech corpus, the YOHO dataset managed and published by LDC, with 108 males and 32 females. The proposed system was developed with several deep learning approaches, such as MLP, LSTM, and BLSTM. In the first experiment, we created a simple MLP model with ten speakers, and the accuracy reached 88.90%. The limitations of MLP led us to consider other approaches. The speech signal is a quasi-stationary and time-varying signal; we require more understanding of these characteristics. The RNN approaches are known as time-series neural network models. Based on considerable research, LSTM and BLSTM are suitable models for such systems.

In the second experiment, we used the LSTM model, which provided an accuracy of 85.22%. The last experiment was BLSTM; the system reached an overall average accuracy of 91.93%. The last model expressed five runs to compare the effect of the speech samples that were selected randomly for each run. The results of all experiments were analyzed, and the errors were considered. A perceptual test was performed, and statistical analysis was used to measure the effectiveness of the speaker’s audio file’s direct impact on the quality of the designed classification model. The BLSTM model used to evaluate the speaker identification system provided the highest accuracy, and for the proposed features, it may require more enhancement.

Future research that could build on this work involves using the proposed features in other system types, such as speaker verification or gender and age recognition. The proposed speaker identification system could be improved by adding extra features such as a spectrogram or GTCC. It would also be interesting to express the developed model in different datasets and languages to compare language’s impact on speech processing systems. Furthermore, the research could be developed using different approaches to deep neural networks, such as CNN.

Author Contributions

Conceptualization, N.M.A., A.A.A. and Y.A.A.; methodology, N.M.A. and Y.A.A.; software, N.M.A.; validation, N.M.A. and Y.A.A.; formal analysis, N.M.A., A.A.A. and Y.A.A.; investigation, N.M.A., A.A.A. and Y.A.A.; resources, N.M.A. and Y.A.A.; data curation, N.M.A.; writing—original draft preparation, A.A.A.; writing—review and editing, N.M.A., A.A.A. and Y.A.A.; visualization, A.A.A.; supervision, Y.A.A.; project administration, A.A.A.; funding acquisition, Y.A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Researchers Supporting Project number (RSP-2022/322), King Saud University, Riyadh, Saudi Arabia.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the Researchers Supporting Project at King Saud University.

Conflicts of Interest

The authors declare that there are no conflict of interest regarding the publication of this article.

References

Kacur, J.; Truchly, P. Acoustic and auxiliary speech features for speaker identification system. In Proceedings of the 2015 57th International Symposium ELMAR (ELMAR), Zadar, Croatia, 28–30 September 2015; IEEE: Piscataway, NJ, USA; pp. 109–112. [Google Scholar] [CrossRef]
Bharali, S.S.; Kalita, S.K. Speaker identification using vector quantization and I-vector with reference to Assamese language. In Proceedings of the 2017 International Conference on Wireless Communications, Signal Processing and Networking, WiSPNET 2017, Chennai, India, 22–24 March 2017; pp. 164–168. [Google Scholar] [CrossRef]
Zeinali, H.; Sameti, H.; Burget, L. HMM-based phrase-independent i-vector extractor for text-dependent speaker verification. IEEE/ACM Trans. Audio Speech Lang. Process 2017, 25, 1421–1435. [Google Scholar] [CrossRef]
Chang, J.; Wang, D. Robust speaker recognition based on DNN/i-vectors and speech separation. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing–Proceedings, New Orleans, LA, USA, 5–9 March 2017; pp. 5415–5419. [Google Scholar] [CrossRef]
YOHO Speaker Verification–Linguistic Data Consortium. Available online: https://catalog.ldc.upenn.edu/LDC94S16 (accessed on 25 June 2023).
Ishac, D.; Abche, A.; Karam, E.; Nassar, G.; Callens, D. A text-dependent speaker-recognition system. In Proceedings of the 2017 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Turin, Italy, 22–25 May 2017; IEEE: Piscataway, NJ, USA; pp. 1–6. [Google Scholar] [CrossRef]
Nasr, M.A.; Abd-Elnaby, M.; El-Fishawy, A.S.; El-Rabaie, S.; El-Samie, F.E.A. Speaker identification based on normalized pitch frequency and Mel Frequency Cepstral Coefficients. Int. J. Speech Technol. 2018, 21, 941–951. [Google Scholar] [CrossRef]
An, N.N.; Thanh, N.Q.; Liu, Y. Deep CNNs With Self-Attention for Speaker Identification. IEEE Access 2019, 7, 85327–85337. [Google Scholar] [CrossRef]
Meftah, A.H.; Mathkour, H.; Kerrache, S.; Alotaibi, Y.A. Speaker Identification in Different Emotional States in Arabic and English. IEEE Access 2020, 8, 60070–60083. [Google Scholar] [CrossRef]
Jahangir, R.; Teh, Y.W.; Memon, N.A.; Mujtaba, G.; Zareei, M.; Ishtiaq, U.; Akhtar, M.Z.; Ali, I. Text-Independent Speaker Identification Through Feature Fusion and Deep Neural Network. IEEE Access 2020, 8, 32187–32202. [Google Scholar] [CrossRef]
Jakubec, M.; Lieskovska, E.; Jarina, R. Speaker Recognition with ResNet and VGG Networks. In Proceedings of the 2021 31st International Conference Radioelektronika (RADIOELEKTRONIKA), Brno, Czech Republic, 19–21 April 2021; IEEE: Piscataway, NJ, USA; pp. 1–5. [Google Scholar] [CrossRef]
Singh, M.K. Robust Speaker Recognition Utilizing Lexical, MFCC Feature Extraction and Classication Technique. 2023. Available online: https://www.researchgate.net/publication/366857924_Robust_Speaker_Recognition_Utilizing_Lexical_MFCC_Feature_Extraction_and_Classification_Technique (accessed on 18 July 2023).
Vandyke, D.; Wagner, M.; Goecke, R. Voice source waveforms for utterance level speaker identification using support vector machines. In Proceedings of the 2013 8th International Conference on Information Technology in Asia (CITA), Kota Samarahan, Malaysia, 1–4 July 2013; IEEE: Piscataway, NJ, USA; pp. 1–7. [Google Scholar] [CrossRef]
Shah, S.H.; Saeed, M.S.; Nawaz, S.; Yousaf, M.H. Speaker Recognition in Realistic Scenario Using Multimodal Data. In Proceedings of the 3rd IEEE International Conference on Artificial Intelligence, ICAI 2023, Islamabad, Pakistan, 22–23 February 2023; pp. 209–213. [Google Scholar] [CrossRef]
Hamsa, S.; Shahin, I.; Iraqi, Y.; Damiani, E.; Nassif, A.B.; Werghi, N. Speaker identification from emotional and noisy speech using learned voice segregation and speech VGG. Expert Syst. Appl. 2023, 224, 119871. [Google Scholar] [CrossRef]
Zailan, M.K.N.; Ali, Y.M.; Noorsal, E.; Abdullah, M.H.; Saad, Z.; Leh, A.M. Comparative analysis of LPC and MFCC for male speaker recognition in text-independent context/Mohamad Khairul Najmi Zailan. ESTEEM Acad. J. 2023, 19, 101–112. [Google Scholar]
CKao, Y.; Chueh, H.E. Voice Response Questionnaire System for Speaker Recognition Using Biometric Authentication Interface. Intell. Autom. Soft Comput. 2022, 35, 913–924. [Google Scholar] [CrossRef]
Gupte, R.; Hawa, S.; Sonkusare, R. Speech Recognition Using Cross Correlation and Feature Analysis Using Mel-Frequency Cepstral Coefficients and Pitch. In Proceedings of the 2020 IEEE International Conference for Innovation in Technology (INOCON), Bangluru, India, 6–8 November 2020; IEEE: Piscataway, NJ, USA; pp. 1–5. [Google Scholar] [CrossRef]
Safari, P.; India, M.; Hernando, J. Self Attention Networks in Speaker Recognition. Appl. Sci. 2023, 13, 6410. [Google Scholar] [CrossRef]
Costantini, G.; Cesarini, V.; Brenna, E. High-Level CNN and Machine Learning Methods for Speaker Recognition. Sensors 2023, 23, 3461. [Google Scholar] [CrossRef] [PubMed]
Campbell, J.P. Testing with the YOHO CD-ROM voice verification corpus. In Proceedings of the 1995 International Conference on Acoustics, Speech, and Signal Processing, Detroit, MI, USA, 9–12 May 1995; IEEE: Piscataway, NJ, USA, 1995; pp. 341–344. [Google Scholar] [CrossRef]
Rabiner, L.R.; Schafer, R.W. Introduction to Digital Speech Processing; Now Publishers Inc.: Norwell, MA, USA, 2007; Volume 1. [Google Scholar] [CrossRef]
Giannakopoulos, T. A Method for Silence Removal and Segmentation of Speech Signals, Implemented in Matlab. 2009. Available online: www.di.uoa.gr/ (accessed on 10 June 2023).
Uzuner, H. Robust Text-Independent Speaker Recognition over Telecommunications Systems. 2006. Available online: https://openresearch.surrey.ac.uk/esploro/outputs/doctoral/Robust-text-independent-speaker-recognition-over-telecommunications/99514390302346 (accessed on 7 January 2023).
de Cheveigné, A.; Kawahara, H. YIN, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Am. 2002, 111, 1917–1930. [Google Scholar] [CrossRef] [PubMed]
Atal, B.S. Automatic Speaker Recognition Based on Pitch Contours. J. Acoust. Soc. Am. 2005, 52, 1687. [Google Scholar] [CrossRef]
Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends. Available online: https://www.researchgate.net/publication/338355547_Deep_Representation_Learning_in_Speech_Processing_Challenges_Recent_Advances_and_Future_Trends (accessed on 26 June 2023).
Suksri, S.; Yingthawornsuk, T. Speech Recognition using MFCC. In Proceedings of the International Conference on Computer Graphics, Simulation and Modeling, Pattaya, Thailand, 28–29 July 2012; Volume 9. [Google Scholar] [CrossRef]
Alashban, A.A.; Qamhan, M.A.; Meftah, A.H.; Alotaibi, Y.A. Spoken Language Identification System Using Convolutional Recurrent Neural Network. Appl. Sci. 2022, 12, 9181. [Google Scholar] [CrossRef]
Sainath, T.N.; Pang, R.; Rybach, D.; He, Y.; Prabhavalkar, R.; Li, W.; Liang, Q.; Strohman, T.; Wu, Y.; McGraw, I.; et al. Two-Pass End-to-End Speech Recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Graz, Austria, 15–19 September 2019; pp. 2773–2777. [Google Scholar] [CrossRef]
Ray, A.; Rajeswar, S.; Chaudhury, S. Text recognition using deep BLSTM networks. In Proceedings of the ICAPR 2015–2015 8th International Conference on Advances in Pattern Recognition, Kolkata, India, 4–7 January 2015. [Google Scholar] [CrossRef]
Zhang, J.; Wang, P.; Yan, R.; Gao, R.X. Deep Learning for Improved System Remaining Life Prediction. Procedia CIRP 2018, 72, 1033–1038. [Google Scholar] [CrossRef]
Nguyen, Q.H.; Ly, H.B.; Ho, L.S.; Al-Ansari, N.; Le, H.V.; Tran, V.Q.; Pham, B.Q. Influence of Data Splitting on Performance of Machine Learning Models in Prediction of Shear Strength of Soil. Math. Probl. Eng. 2021, 2021, 4832864. [Google Scholar] [CrossRef]
Moumin, A.A.; Kumar, S.S. Automatic Speaker Recognition using Deep Neural Network Classifiers. In Proceedings of the 2021 2nd International Conference on Computation, Automation and Knowledge Management (ICCAKM), Dubai, United Arab Emirates, 19–21 January 2021; IEEE: Piscataway, NJ, USA; pp. 282–286. [Google Scholar] [CrossRef]

Figure 1. YOHO dataset files (numbers from 101 up to 277 present the speaker index; below them the enroll and verify sessions are listed).

Figure 2. Overview of the speech detection algorithm.

Figure 3. Detected speech plot.

Figure 4. The feature extraction process.

Figure 5. Block diagram of the pitch detector.

Figure 6. Calculating mel frequency.

Figure 7. Proposed speaker identification model.

Figure 8. MLP model architecture.

Figure 9. LSTM model architecture.

Figure 10. BLSTM model architecture.

Figure 11. MLP model results.

Figure 12. Comparisons with state-of-the-art.

Table 2. YOHO LDC speech dataset.

Gender	Speaker	Utterance		Total for Each Speaker	Total Dataset Utterances
Gender	Speaker	Enrolment	Verification	Total for Each Speaker	Total Dataset Utterances
Male	106	4 × 24 phrases	10 × 4	136 utterances	14,416
Female	32	4 × 24 phrases	10 × 4	136 utterances	4352

Table 3. Results for speech detection.

	Minimum Duration	Maximum Duration	Average Duration	Standard Deviation
Dataset before Detection	2.619875 s	7.699875 s	3.94195568 s	0.47355293
Dataset after Detection	1.40775 s	3.881375 s	2.16832302 s	0.25270021

Table 4. Confusion matrix.

		Predicted
		Negative	Positive
Actual	Negative	True Negative	Fales Positive
Actual	Positive	Fales Negative	True Positive

Table 5. MLP training parameters.

Training Function	Output Layer Transfer Function	Performance Measurement	Data Normalization	Learning Rate
Trainrp	Softmax	Cross-entropy	Standard	0.0003

Table 6. LSTM training parameters.

Optimizer	‘Adam’
Initial Learn Rate	0.0005
Max Epochs	100
Mini Batch Size	8
Batch Normalization Statistics	‘Moving’
Shuffle	Every-epoch

Table 7. The three LSTM experiment runs’ results.

Run Number	Accuracy	Recall	Precision	F1 Score
1	84.34%	85.54%	84.34%	84.94%
2	85.07%	86.14%	85.07%	85.50%
3	85.22%	86.55%	85.22%	85.88%
Average	84.88%	86.08%	84.88%	85.69%

Table 8. The five BLSTM experiments’ results.

Run Number	Accuracy	Recall	Precision	F1 Score
1	92.26%	93.03%	92.26%	92.64%
2	91.28%	92.16%	91.28%	91.72%
3	92.70%	93.24%	92.70%	92.97%
4	92.24%	92.67%	92.24%	92.45%
5	91.15%	91.90%	91.15%	91.53%
Average	91.93%	92.60%	91.93%	92.26%

Table 9. Perceptual test results.

Listener Number and Gender	Speaker 14M	Speaker 24M	Speaker 27M	Speaker 134F	Speaker 45M	Speaker 49M	Speaker 62M	Total Miss Predictionof the Test Samples
1M	3 of 5	1 of 5	4 of 5	3 of 6	3 of 5	1 of 5	0 of 4	15 of 35
2M	2 of 5	2 of 5	3 of 5	2 of 6	3 of 5	2 of 5	1 of 4	15 of 35
3M	0 of 5	3 of 5	1 of 5	3 of 6	0 of 5	0 of 5	2 of 4	9 of 35
4F	0 of 5	0 of 5	2 of 5	0 of 6	1 of 5	0 of 5	0 of 4	3 of 35
5M	2 of 5	3 of 5	0 of 5	3 of 6	1 of 5	3 of 5	1 of 4	12 of 35
6F	0 of 5	3 of 5	3 of 5	0 of 6	0 of 5	3 of 5	3 of 5	12 of 35

Table 10. Compare features using BLSTM architecture.

Experiments	Used Features	Accuracy (%)
1	Pitch only (F0, ∆F0, and ∆∆F0)	69.75
2	Energy only (E, ∆E, and ∆∆E)	64.83
3	MFCCs only (MFCCs, ∆MFCCs, and ∆∆MFCCs)	91.20
4	Pitch and Energy (F0, ∆F0, ∆∆F0—E, ∆E, and ∆∆E)	78.40
5	Pitch and MFCCs (F0, ∆F0, ∆∆F0—MFCCs, ∆MFCCs, and ∆∆MFCCs)	91.69
6	Energy and MFCCs (E, ∆E, ∆∆EE—MFCCs, ∆MFCCs, and ∆∆MFCCs)	95.52
7	All features (F0, ∆F0, ∆∆F0—E, ∆E, ∆∆EE—MFCCs, ∆MFCCs, and ∆∆MFCCs)	92.70

Table 11. Comparisons with state-of-the-art.

Reference	Year	Dataset	Features	Technique	Recognition Accuracy Rate (%)
[13]	2013	YOHO	Pitch	SVM	85.30
[6]	2017	-	Spectrogram	Correlation	91.00
[18]	2020	-	MFCCs and Pitch	Correlation	92.00
[34]	2021	TIMIT	MFCCT	ANN	92.90
[11]	2021	VoxCeleb1	Spectrogram	VGG ResNet	93.80
[12]	2023	-	MFCCs	CNN	92.46
Our Proposed Features	2023	YOHO	F0, ∆F0, ∆∆F0—E, ∆E, and ∆∆EE MFCCs, ∆MFCCs, and ∆∆MFCCs	BLSTM	92.70
Our Best Proposed Features	2023	YOHO	E, ∆E, and ∆∆EE MFCCs, ∆MFCCs, and ∆∆MFCCs	BLSTM	95.52

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Almarshady, N.M.; Alashban, A.A.; Alotaibi, Y.A. Analysis and Investigation of Speaker Identification Problems Using Deep Learning Networks and the YOHO English Speech Dataset. Appl. Sci. 2023, 13, 9567. https://doi.org/10.3390/app13179567

AMA Style

Almarshady NM, Alashban AA, Alotaibi YA. Analysis and Investigation of Speaker Identification Problems Using Deep Learning Networks and the YOHO English Speech Dataset. Applied Sciences. 2023; 13(17):9567. https://doi.org/10.3390/app13179567

Chicago/Turabian Style

Almarshady, Nourah M., Adal A. Alashban, and Yousef A. Alotaibi. 2023. "Analysis and Investigation of Speaker Identification Problems Using Deep Learning Networks and the YOHO English Speech Dataset" Applied Sciences 13, no. 17: 9567. https://doi.org/10.3390/app13179567

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Analysis and Investigation of Speaker Identification Problems Using Deep Learning Networks and the YOHO English Speech Dataset

Abstract

1. Introduction

2. Literature Review

3. YOHO LDC Dataset

4. Methodology

4.1. Pre-Processing of Speech Signals

4.1.1. Signal Normalization

4.1.2. Speech Detection

4.1.3. Frame Analysis and Windowing

4.2. Features Extraction

4.2.1. Pitch (F0)

4.2.2. Log Energy (E)

4.2.3. Mel Frequency Cepstral Coefficients (MFCCs)

4.3. Deep Neural Network Model

Performance Evaluation

5. Experiments

5.1. Multi-Layer Perceptron (MLP)

5.2. Recurrent Neural Networks (RNNs)

5.2.1. Long Short-Term Memory (LSTM)

5.2.2. Bidirectional Long Short-Term Memory (BLSTM)

6. Experiments Results and Discussion

6.1. MLP Model Results

6.2. RNNs Models Results

6.2.1. LSTM Model Results

6.2.2. BLSTM Model Results

6.3. Perceptual Test Results

6.4. Discussion

6.4.1. Compare the Impact of the Features in Our Model

6.4.2. Comparisons with State-of-the-Art Speakers Identification System

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI