1. Introduction
Over the last two decades, with the rapid development of artificial intelligence, voiceprint, iris, fingerprint, face and other biometrics have been of wide concern [
1,
2,
3]. Speech is the most common way to communicate and convey information in people’s daily life. A person’s vocal tract structure determines that person’s unique vocal characteristics. This makes speaker recognition possible. Speaker recognition technology is a kind of biometrics technology, which automatically distinguishes the speaker’s identity information through the unique features contained in the voice. Generally speaking, the speaker recognition mainly comprises two important branches: speaker identification and speaker verification [
4]. The former is to select the speaker with the highest similarity by comparing the speech of the speaker to be recognized with the trained models. It is a multi-classification problem. However, the latter is to determine whether the input speech belongs to the specific trained speaker model. It is a binary judgment problem. The technologies of speaker identification have been widely discussed in recent years.
The speaker identification system mainly consists of three parts: speech signal preprocessing, feature parameters extraction, and classification [
5]. Since human beings are influenced by their own physical conditions and external environment in the course of communication, the extraction of the distinguishable information from complex speech is a challenging task. Mel-Frequency Cepstral Coefficient (MFCC) [
6,
7,
8,
9,
10], Linear Prediction Cepstrum Coefficient (LPCC) [
11,
12], Perceptual Linear Predictive (PLP) [
13]) and Linear Predictive Coding (LPC) [
14] are the most frequently used traditional features in speaker recognition. Wu and Cao [
6] replaced the logarithmic transformation in the standard MFCC analysis with a combined function to improve the noisy sensitivity. The experiments showed that MFCC-based feature reduced the error rate significantly under the noisy environment. Sahidullah and Saha [
9] proposed a novel windowing technique to compute MFCC. The method was based on the fundamental property of Discrete Time Fourier Transform (DTFT) related to differentiation in the frequency domain and it achieved good substance and consistency. In Reference [
10], a novel algorithm of extracting MFCC for speech recognition was proposed. They modified the filter bank and added the filter bank to generate the power coefficient. It could effectively reduce the consumption of computer hardware. There were also some classical models such as Gaussian Mixture Model (GMM)-Support Vector Machine (GMM-SVM) [
15], GMM-Universal Background Model (GMM-UBM) [
16] and Probabilistic Linear Discriminant Analysis/i-vector (PLDA/i-vector) [
17] applied to speaker recognition. In recent years, more and more researchers have used deep networks to complete speaker recognition. Shahin et al. [
18] proposed a new classifier called cascaded Gaussian mixture model-deep neural network. They used the GMM to generate the emotional tags of each speaker under each emotional speaking condition, and then the new vector of features was used as the input of the Deep Neural Network (DNN) classifier. Lastly, the output of the DNN was the final classification results. The performance of the system was tested on the Emirati speech database and “speech under simulated and actual stress” English dataset. Work in [
19] used the neural network for classification and wavelet transform to extract feature parameters. The result demonstrated that the performance was better than the Multi-Layer Perceptron (MLP)-based classification in the aspect of recognition accuracy, average precision, average recall and root mean square error. Matejka et al. [
20] investigated combining the deep bottleneck features with the traditional MFCC features to complete the speaker identification. In Reference [
21], they utilized the DNN to extract the deep feature for automatic speaker recognition and language recognition. The results showed that a 55% reduction in equal error rate for the 2013 Domain Adaptation Challenge out-of-domain condition and a 48% reduction on the NIST 2011 language recognition evaluation 30 s test condition.
The several different speaker identification methods described above have been widely accepted and applied for their respective special advantages and good recognition performance, but there are still some shortcomings. The traditional features only reflect the speaker’s physical information and represent shallow characteristics of the speech. They cannot fully exploit the deep structural information of speech signals [
22]. The deep neural network can extract deep features of speech segments by simulating the structure of the human brain, but it ignores the most basic physical layer characteristics. Therefore, in order to fully express the features of speech signals and take advantage of each model, some studies have proposed different fusion strategies to complete speaker recognition in recent years. Omar et al. [
23] proposed an MLP network based on feature fusion to train the recognition system. The LPC and MFCC were fused and then input into the MLP, which was used as a classifier for speaker identification system. In the paper [
24], they took full account of the complementarity between different levels of speech signals and proposed the fusion method of deep and shallow features for the speaker verification system. Compared with the baseline system, the EER was reduced by 54.8%. When the training speech is short utterances, the GMM is failing to achieve a good performance. For the sake of find a solution to the problem, work in [
25] utilized the Convolutional Neural Network (CNN) to process the spectrogram of speech signal and combined it with the GMM for scoring fusion. In order to improve the robustness of the speaker verification under noisy environments, Asbai and Amrouche [
26] proposed a new method of weighted score fusion. These studies fully prove that the fusion method can effectively improve the performance of the speaker recognition system.
Since the deep and shallow features reflect the speaker’s information from different aspects, the speaker’s characteristics can be more comprehensively represented by effective fusion. Therefore, we propose a new speaker recognition method based on the fusion of deep and shallow Gaussian supervector. In this method, the MFCC is firstly obtained from the input speech signal, and then DNN is used to obtain the bottleneck features, which are used to acquire the deep Gaussian supervector. On the other hand, we input the MFCC into the GMM directly to obtain the traditional Gaussian supervector. Lastly, we fuse the two kinds of features to form a new vector to train the SVM and complete the speaker classification.
The main contributions made in this paper can be summarized here: (1) We design a DNN network to extract the deep bottleneck features that contain more discriminative information from different speakers. (2) In order to take into account the complementarity between different hierarchical features, we propose a novel fusion model to form a new Gaussian supervector for speaker recognition. (3) We propose a speaker recognition system based on optimization weight coefficient, which improves the robustness of the system. (4) We explore the factors that affect the performance of the system recognition and utilize the Fisher criterion to filter redundant information.
The remainder of the paper is organized as follows:
Section 2 describes the proposed speaker identification system based on fusion features, MFCC, recombined Gaussian supervector and feature selection strategy, respectively.
Section 3 mainly elaborates on the speaker identification system based on optimized weight coefficients. The experimental results and analysis are presented in the
Section 4. Finally, the conclusion of this work is given in
Section 5.
2. Proposed Speaker Recognition System
In a speaker identification system, it is vital to extract some features that can indicate speaker identity information, and then these features are used to train the classification model. Finally, the model is used for identification. Therefore, the performance of the speaker’s recognition system is directly impacted by the quality of features. A single feature often cannot fully reflect the speaker’s personality information, resulting in a low recognition rate. The traditional acoustic characteristics mostly consider the information of the physical layer of speech signal, and more reflect the shallow features of the human auditory perception and vocal tract. Therefore, it is difficult to represent the high-level information of speech segments. In recent years, the DNN have adopted a multi-layer network structure to simulate human brains, which can fully deeper identity information related to the speaker. However, it does not involve the most intuitive acoustic features of the physical layer, which may also lead to poor system performance. Thus, in order to further improve the performance of the speaker identification system, we propose a novel recognition method that fuses the depth features and traditional acoustic features to accurately recognize the speaker’s identity. The system block diagram of the proposed model is shown in
Figure 1.
In the training stage, the input speech signal is preprocessed by endpoint detection, pre-emphasis, framing, and windowing. Then, the MFCC is obtained from the processed signals to train the DNN. After the training process is completed, we extract the deep bottleneck features. Since the GMM achieves excellent performance in the field of speaker recognition, we use GMM to further get the deep Gaussian supervector of speech. On the other hand, in order to obtain the traditional acoustic characteristics, we input the MFCC to the GMM directly to obtain the traditional Gaussian supervector. The Gaussian supervector reflects the mean statistical characteristics of speech signals separately, but they ignore the relevance of different frames. Therefore, we recombine the extracted traditional and deep Gauss supervectors. Finally, the obtained deep and traditional recombined supervectors are fused in the form of the augmented vector dimension. That is, the traditional Gaussian mean supervector is horizontally spliced in the depth supervector to form a new vector with higher dimension and more personalized information. The new fused features are used to train the classifier SVM. In the test stage, we also get the fusion supervector of the test speech data according to the processing method of the training phase, and then input them into the trained SVM to obtain the classification results.
2.1. Recombined Gaussian Supervector
In the previous work, the traditional features such as MFCC [
27], Gaussian statistical characteristics [
15] were often applied to speaker identification, which had good performance. In this paper, we extract 48-dimension MFCC from the input speech and calculate the Gaussian statistics as the input feature to train the SVM.
2.1.1. MFCC
If the speech lasts no more than 30 ms and the frame shift is 10 ms, this voice is considered to short-term stable. Therefore, before extracting the MFCC, we often need to preprocess it first. The preprocessing mainly includes endpoint detection, pre-emphasis, framing, and windowing.
The pre-emphasis part can be realized by a high-pass filter, which is equivalent to
where
is the pre-emphasis coefficient (usually in the interval [0.9, 1]).
In the process of experiment, we adopt the Hamming window to smooth edge of framed signals and make it periodic. The window function is defined as follows,
MFCC represents the transient power range of human speech [
15]. Mel frequency reflects the conversion relationship between actual frequency and perceptual frequency. It can be obtained by using the formula [
28]
where
is the actual frequency and its unit is Hz.
The specific steps and flow chart of the extraction process are shown in
Figure 2. Firstly, the continuous speech signal in time domain is transformed into discrete digital signal by sampling, framing and windowing, and then FFT or DFT transformation is applied to each frame to obtain the corresponding linear spectrum. Secondly, the actual frequency is converted into Mel frequency scale, and the linear spectrum is input into the Mel filter bank for filtering to obtain Mel spectrum. Next, logarithmic power spectrum is obtained by logarithmic operation. Finally, the correlation between the components is eliminated by DCT transformation, and the MFCC parameters are obtained. In addition, the first-order difference parameters ∆MFCC describing the dynamic features are also selected as speech features.
2.1.2. Extraction of Recombined Gaussian Supervector
GMM has been widely utilized in speaker recognition. In this paper, we mainly adopt the GMM to extract the Gaussian mean supervector. The parameters are estimated from input data using the Expectation Maximum (EM) algorithm [
29]. The M-order GMM Gaussian probability formula is given by
where
X is a D-dimensional random vector, and
is the density function represented in the vector space
,
is the mixed weight and satisfies
. The density is given by
where
refer to the mean vector and covariance matrix, respectively. Therefore, we usually use the model
to represent the kth mixture component.
Given a feature vectors set
, the aim of applying the GMM is to compute the necessary statistics. We first set the Gaussian component number and initial value, and then use the EM algorithm to estimate a new parameter
. The new model parameters are input to the next training until the model converges. The parameters are shown as follows
In this paper, we mainly use the mean vector
of each Gaussian component. In order to make the input vector contain more personality information, we connect them to form the mean supervector. It can be represented as follows:
, where the
is the mean vector of
i-th component. If we consider each mean vector separately, the correlation between them will be ignored. If the Gaussian correlation number is too large, the performance of the system will be reduced due to the decrease of the feature correlation between multiple frames. Therefore, it is important to select an appropriate Gaussian correlation number
to recombine the feature vector. The first new mean vector obtained is
. According to this rule, the recombined supervector is obtained by traversing the entire supervector in turn. Finally, we will get
reconstructed supervectors. The relationship between
and
satisfies the following equation
where
M is the number of original Gaussian supervectors. The new traditional recombined vector can be expressed by
, where
represents the each recombined supervector.
2.2. Deep Recombined Gaussian Supervector
Since the traditional features such as MFCC, LPCC and Gaussian supervector simply represent the shallow physical information of the speaker’s voice, they cannot extract the features on a deeper level. Therefore, it is necessary to get the feature vector, which can remove redundant information and reflect the speaker’s identity information more deeply. DNN has achieved an overall success in automatic speaker recognition [
30,
31]. There are two major applications of DNN: one is used as a classifier and the other is to extract speech features frame by frame. In our work, we design a DNN to obtain deep bottleneck features.
2.2.1. Deep Neural Network Model
The DNN is an MLP with multiple hidden layers, each of which is implemented by Restricted Boltzmann Machine (RBM) [
32]. The value of the input and hidden units is generally binary which obey the Bernoulli distribution. The energy function is defined by
where
represents the state of the visible and hidden layer, respectively. The parameter
denotes the connection weights between visible and hidden unit as
and the biases of the visible and hidden layers.
The training of DNN can be divided into two stages: pre-training and fine-tuning. In the pre-training stage, we adopt an unsupervised method to finish the initialization of DNN. Contrastive Divergence (CD) [
33] is used to estimate the parameters of RBM. In the fine-tuning stage, we adopt the Back Propagation (BP) algorithm to finely adjust the network parameters. In this process, the parameter of the network is adjusted supervised. Thus, it is necessary to align each frame of training data with corresponding speaker labels.
2.2.2. Extraction of Deep Recombined Gaussian Supervector
In general, the DNN is composed of the input layer, the output layer and the hidden layer. We mainly extract deep bottleneck features from raw speech. The
Figure 3 shows the DNN structure used in this paper. We design five layers’ network to train the speech signal, including the input layer, three hidden layers and the output layer. The structure of the network is 200-200-48-200-10. The number of neuron nodes in the output layer is the total number of speakers to be identified. Firstly, MFCC features of the training speech signals after preprocessing are extracted as the input of DNN. After the pre-training and fine-tuning stage are completed, the bottleneck layer is regarded as the new output. Thus, the traditional characteristic parameters are converted into the deep bottleneck features.
After the deep bottleneck features are obtained, we use them as the input of GMM to get the Gaussian mean vector. Since there are some correlations between the Gaussian mean vectors of different frames, we further recombine these vectors according to the rules mentioned in
Section 2.1.2. The new deep recombined Gaussian supervector is expressed by
, where
represents each recombined component.
2.3. Classification Based on Fusion Features
In order to consider the complementarity between deep and shallow recombination supervector, we splice the traditional recombined supervector horizontally behind the deep recombined Gaussian supervector. Through
and
, we can get the new fusion features as follows,
after the fusion vectors are obtained, we input them into the SVM so as to achieve a judgement.
2.3.1. Support Vector Machine Classifier
The target of SVM training is to find the maximum margin hyperplane for learning samples that can distinguish different speaker identity information. When the input sample data has linear separability, the learning of SVM can be achieved by solving the following optimization problems
where
represent the weight vector. If the data set is nonlinear, we need to introduce the kernel function which can map the original data to a new feature space. The dimension of the new feature space is higher than that of the former one. In addition, since the Radial Basis Function (RBF) has shown its unique advantages in pattern recognition, the RBF is used in the proposed model. The formula of RBF is as follows
which is a radial symmetric scalar function. It is usually defined as a monotone function of Euclidian distance between any point
and a certain center
in a space, which can be denoted as
.
In our work, speaker recognition is a multi-classification task. One-to-one and one-to-many are the two main methods to realize the multiple classification problem in SVM. Since the speed of the former is much faster, we adopt the One-to-one to finish speaker recognition.
2.3.2. Fisher Criterion Selection
In the fusion model, the dimension of feature fusion input into SVM classifier may be very large, which will increase the modeling time. Therefore, an effective dimension reduction strategy should be taken to remove the useless speaker identity information and reduce the computational complexity of the model. We choose the Fisher criterion [
34] to complete the feature selection of the deep and shallow recombination Gaussian supervector. The main idea of Fisher criterion selection is that the Euclidean distance between the same category is smaller, while the distance between different features is larger. We define the q-dim feature of the
i-th emotion as
and the Fisher criterion discriminant coefficient can be calculated by
where
represents the total number of speakers,
and
are the mean and variance of the vector
. In our proposed method, after the deep and shallow recombined supervector are fused, we calculate the Fisher coefficients between fusion features of the different identified speakers. Next, we sort them in ascending order and remove the corresponding features with smaller coefficients. Finally, the reserved features form a new vector and some irrelevant information can be eliminated.
3. Optimization of Feature Weight Coefficient
In the above speaker recognition system, when the two types of feature are fused, they are directly spliced in the horizontal direction. That is to say, we default that the two parameters have the same contribution to the system, and there is no processing in the subsequent steps. It is found that when the number of speakers increases, the difficulty of recognition will increase. Therefore, the recognition accuracy of the system will decrease. For each speaker, the contribution of different parameters to the final recognition result is different, so the importance of each parameter will be involved. When several types of features are fused, the weight coefficients between them will be considered to further improve the recognition rate and reduce the probability of misjudgment. Therefore, in order to measure the weight of each feature more accurately, we use two common optimization algorithms—Genetic Algorithm (GA) [
35] and Simulated Annealing (SA) [
36] algorithm to find the most appropriate weights. The system block diagram is shown in
Figure 4.
When the number of speakers increases, using only two features is not enough to describe the speakers comprehensively. In order to more fully describe the identity information, in the system shown in
Figure 4, three different characteristics are used to obtain the fusion features. In the above system, we use three features: deep recombined Gaussian supervector, traditional recombined Gaussian supervector and MFCC. Assuming that the
i-th feature is expressed by
, the corresponding weight coefficients is
. The fused features of optimized weight coefficients can be expressed as follows
where the
represents the fused feature, and
N is the number of features. In this paper, the value of
N is 3.
In the training stage, the GA or SA algorithms can be used to find the optimal weights of three types of features, and then they are multiplied by their respective coefficients. Lastly, they are spliced horizontally to form a new feature. In the test stage, we also use the method of training phase to obtain the fusion characteristics of the test set. When the training feature and test feature are obtained, and then input them into the classifier SVM to obtain the identity information of the speakers.
5. Conclusions
In this paper, we present a novel speaker recognition model based on deep and shallow recombined Gaussian supervector, which can effectively improve the system’s performance. In this proposed approach, we first extract MFCC from the original speech signal, and then input them into the DNN to extract the depth bottleneck feature to obtain the depth Gaussian supervector further. On the other hand, we directly use the MFCC to train the Gaussian mixture model to get the traditional Gaussian supervector. Finally, they are recombined and spliced horizontally to form a higher dimension fusion feature. New features are used to train SVM for final judgment. In order to obtain the best performance, we adjust the network parameters and select the optimum depth characteristics through experiments. To assess the new approach, we compare it with the system based on depth features or traditional features alone. The finding results show that the fusion method can enhance the system performance effectively. In addition, when the number of speakers to be recognized increases, in order to prevent the system recognition rate from falling sharply, we introduce the optimization algorithm to find the optimal weight before the feature fusion. The experimental results demonstrate that the fusion feature based on optimized weight coefficients can improve the recognition rate by 0.81%. Due to feature fusion, the vector dimensions input into SVM will be enlarged, resulting in higher system complexity and longer running time. The Fisher criterion can merely reduce a small part of redundant information. Therefore, the key focus of our next research direction is to find a superior algorithm to further optimize the time and computational complexity of the system.