Pseudo-Phoneme Label Loss for Text-Independent Speaker Verification

Niu, Mengqi; He, Liang; Fang, Zhihua; Zhao, Baowei; Wang, Kai

doi:10.3390/app12157463

Open AccessArticle

Pseudo-Phoneme Label Loss for Text-Independent Speaker Verification

by

Mengqi Niu

¹,

Liang He

^1,2,*,

Zhihua Fang

¹

,

Baowei Zhao

¹ and

Kai Wang

³

¹

School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China

²

Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

³

State Grid Xinjiang Electric Power Co., Ltd., Urumqi 830002, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(15), 7463; https://doi.org/10.3390/app12157463

Submission received: 9 July 2022 / Revised: 22 July 2022 / Accepted: 22 July 2022 / Published: 25 July 2022

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Compared with text-independent speaker verification (TI-SV) systems, text-dependent speaker verification (TD-SV) counterparts often have better performance for their efficient utilization of speech content information. On this account, some TI-SV methods tried to boost performance by incorporating an extra automatic speech recognition (ASR) component to explore content information, such as c-vector. However, the introduced ASR component requires a large amount of annotated data and consumes high computation resources. In this paper, we propose a pseudo-phoneme label (PPL) loss for the TI-SR task by integrating content cluster loss at the frame level and speaker recognition loss at the segment level in a unified network by multitask learning, without additional data requirement and exhausting computation. By referring to HuBERT, we generate pseudo-phoneme labels to adjust a frame level feature distribution by deep cluster to ensure each cluster corresponds to an implicit pronunciation unit in the feature space. We compare the proposed loss with the softmax loss, center loss, triplet loss, log-likelihood-ratio cost loss, additive margin softmax loss and additive angular margin loss on the VoxCeleb database. Experimental results demonstrate the effectiveness of our proposed method.

Keywords:

speaker verification; pseudo-phoneme label loss; deep cluster; multitask learning

1. Introduction

Speaker verification (SV), as a kind of biometric recognition technology, is used to determine an unknown speaker’s identity from a segment of her/his utterance and plays an increasingly important role in the areas of access control, transaction authentication, monitoring, information retrieval, forensics, etc. Usually, it can be categorized into text-dependent speaker verification (TD-SV) and text-independent speaker verification (TI-SV) [1] to whether the spoken text is restricted or not. The TD-SV method can use the semantic information in the speech to build a more accurate model because it knows the content of the speech in advance, and this method can obtain high-performance verification results. TI-SV has high flexibility and a wide range of applications, but its performance is not as good as TD-SV under the same speech conditions. In this paper, we focus on TI-SV.

Some research introduced speech information into the TD-SV system based on a multi-task learning approach. On the one hand, speaker adaptation technology is widely used in speech recognition to improve recognition accuracy, indicating that human vocal features are reflected in pronunciation information [2,3]; on the other hand, the performance of TD-SV with fixed speech content is significantly better than that of TI-SV [4], indicating that changes in pronunciation content also have an impact on the vocal features. Therefore, consideration of both vocal and pronunciation feature in speech will help to improve the effectiveness of the speaker representation.

The c-vector [5] proposes a frame-level acoustic feature extraction method based on multi-task learning of acoustic and pronunciation information. The articulation information is fused into the frame-level vocal pattern features by partially sharing the vocal pattern feature vector extraction model with the implicit layer of the automatic speech recognition (ASR) [6,7] model. It is necessary to train an ASR model in addition to the speaker recognition system, which not only requires a large amount of labeled data but also increases the amount of computation.

At present, most of the improvements in speaker recognition system focus on improving the loss function, pooling method and network structure.

In the early studies, the softmax layer and the Cross-Entropy (CE) loss function obtained good performance. The CE loss focused on measuring the difference between the real value and the predicted value. Compared with it, the center loss [8] simultaneously learns a center for each class and penalizes the distances between features and their corresponding class centers at the frame level. Refs. [9,10,11,12,13] try to enlarge the decision boundaries between classes to distinguish different speaker vectors. Most of the existing loss functions for the speaker verification task adopt a multi-class classifier at the training stage. However, most SV systems based on deep neural networks are usually trained without considering their main goal, which is making a binary decision.

The triplet loss [14,15] and the log-likelihood-ratio cost function (CLLR) [16], which can directly make the final decision without an extra back-end module, have been well experimented on several databases. The performance is improved at the cost of difficult training.

In SV, pooling is often used for frame-level information aggregation in speaker recognition and verification. An averaging pooling layer is the most popular method to aggregate frame-level speaker feature vectors to obtain segment-level embedding for variable durations of speech. Subsequently, Refs. [17,18,19] proposed a statistics pooling layer that computes not only the mean, but also the standard deviation of frame-level features. These two vectors were concatenated to form a fixed-length representation of the input utterance. The authors in [20] proposed a method of attentive statistics pooling, which computed weighted means and deviations for pooling by attention mechanism to extract speaker embeddings in a more accurate and effective way. The authors in [21] took a structured self-attention mechanism and incorporated it into the pooling layer.

During the past few years, in addition to exploring different loss functions and pooling methods, network architecture design has also become an important part of improving the system. Most recent studies in this field focus on the x-vector proposed by Snyder et al. [19], which took the place of i-vector [22] and became the baseline for its great success. The neural network of a typical x-vector consists of three main components: (1) a front-end neural network for extracting frame-level features, (2) a pooling layer that aggregates the frame-level representations across the temporal dimension to form utterance-level statistics and (3) a speaker classifier followed by a multi-class Cross-Entropy loss. The front-end neural network could be a time-delay neural network (TDNN) [23], long short-term memory (LSTM) networks [24], convolutional neural network (CNN) [25,26], etc. State-of-the-art SV systems architectures are ECAPA-TDNN [27] and RepVGG [28], but these structures are also based on the basic x-vector structure. Due to the simple structure of x-vector and the convenience of training, our experiments are based on the basic x-vector.

In this paper, we compare the performance of various loss methods on TI-SV and propose a novel pseudo-phoneme label loss. We adopt a multi-task learning approach [29] to take both content information at the frame level and speaker information at the segment level into consideration for calculating the loss function. Unlike the c-vector [5], we just generate pseudo-phoneme labels by a simple clustering method to calculate the clustering loss. No additional annotated ASR data is required to train the ASR module. Compared with the previous methods, the main contributions of this paper are as follows:

We compare the performance of existing primary loss methods on speaker verification. The loss function includes softmax loss, center loss, triplet loss, log-likelihood-ratio cost loss, additive margin softmax loss and additive angular margin loss.
Based on HuBERT and DeepCluster, we propose a pseudo-phoneme label loss for TI-SV that introduces speech context information to boost speaker verification performance without an additional ASR component. We also tried different implementations of PPL loss and experimented with the VoxCeleb dataset.
We test the performance of PPL loss on the x-vector system. Including the combination of PPL loss with different loss functions, it is proven by experiments that our proposed PPL loss can improve the system performance.
Experimental results on the VoxCeleb database demonstrate the effectiveness of the proposed method. We also explored the impact of different parameter settings of PPL loss on the final performance. It was verified through experiments that when using PPL loss, different parameter settings were improved compared to the baseline system, proving the effectiveness of PPL loss.

The subsequent contents of this paper are organized as follows: Section 2 depicts several common losses in the speaker recognition tasks; Section 3 presents pseudo-phoneme label loss; Section 4 illustrates our settings; Section 5 gives experimental results and analysis. The conclusions are drawn in Section 6.

2. Related Work

This section introduces some of the loss functions and clustering methods used in the experiments.

2.1. Loss Function

A loss function is a function that measures the difference between the predicted and true values of a model. This subsection introduces previous studies on loss functions, including cross-entropy loss, center loss and margin-based loss functions. In addition, we also explore the triplet loss and CLLR loss used in the end-to-end system. In the loss function formulas listed below, some common symbolic representations are defined as follows.

The input samples are denoted as

x_{i}

,

i \in {1, \dots n}

, the number of samples is denoted as n, the true labels of the

x_{i}

are denoted as

y_{i}

, the number of total sample classes is denoted as C, the weight matrix of the last fully connected layer of the model is denoted as W, b is the bias value and the angle between the extracted speaker embedding and the corresponding

W_{y_{i}}

is denoted as

θ_{y_{i}}

.

2.1.1. Cross-Entropy Loss

The Cross-Entropy (CE) loss function is usually used for classification tasks, and it measures the difference between the predicted values of the neural network and the true values of the samples. It is calculated as follows:

L_{CE} = - \frac{1}{n} \sum_{i}^{n} log \frac{e^{W_{y_{i}}^{T} \cdot x_{i} + b_{y_{i}}}}{\sum_{j}^{C} e^{W_{j}^{T} \cdot x_{i} + b_{j}}}

(1)

2.1.2. Center Loss

Ref. [30] points out that for the closed-set classification problem, the network parameters learned from the CE loss function performs well but lacks a good generalization for the open-set classification problem. To overcome the shortcomings of CE loss, Wen Yandong et al. proposed a center loss [30], which aims to construct a center embedding vector for each class, and this center embedding is continuously updated with training. The distance between each speaker embedding and the corresponding center embedding is calculated and accumulated to obtain the center loss. After that, the center loss is multiplied by the corresponding weights and summed with the CE loss to obtain the final loss, as follows:

L_{S} = - \sum_{i = 1}^{n} log \frac{e^{W_{y_{i}}^{T} x_{i} + b_{y_{i}}}}{\sum_{j = 1}^{C} e^{W_{j}^{T} x_{i} + b_{j}}}

(2)

L_{C} = \frac{1}{2} \sum_{i = 1}^{n} {∥x_{i} - c_{y_{i}}∥}_{2}^{2}

(3)

L = L_{S} + λ L_{C}

(4)

The

c_{y_{i}}

denotes the center embedding of the

y_{i}

class, and

λ

is the weight of center loss.

Inspired by center loss, Sarthak Yadav et al. [8] then introduced the idea of center loss into the field of speaker recognition. Their experiments demonstrate that center loss can achieve good results in the field of speaker recognition.

2.1.3. Margin Loss

In general, the goal of the CE loss function is to distinguish between different classes. This leads to a problem that possibly two embeddings of two different classes at the same edge may be more similar than the embedding of the same class in different directions. To solve this problem, it is necessary to maximize the inter-class distance and minimize the intra-class distance at the same time. So, a series of loss methods have been derived to expand the inter-class interval for better distinguishing different classes. The current popular margin loss is AM-Softmax loss and AAM-Softmax loss.

AM-Softmax: Wang Feng et al. [31] proposed the AM-Softmax loss function based on the A-Softmax [10] loss function, with the specific modification of normalizing

x_{i}

so that the extracted embedding is mapped to a hyper-sphere. Additionally, the A-Softmax also has to calculate the angle

θ

by the inverse cosine function. At the same time, the AM-Softmax only needs to subtract the cosine value directly from the margin. The advantage of this is that the computation cost is significantly reduced. Finally, the corresponding index value is multiplied by a weight s to scale up the corresponding class probability when performing softmax. By controlling both the hypersphere and the class gap, the AM-Softmax [9] achieves excellent results. The specific formula is as follows:

L_{AM} = - \frac{1}{n} \sum_{i = 1}^{n} log \frac{e^{s \cdot (cos θ_{y_{i}} - m)}}{e^{s \cdot (cos θ_{y_{i}} - m)} + \sum_{j = 1, j \neq y_{i}}^{C} e^{s \cdot cos θ_{j}}}

(5)

The AM-Softmax loss function has become one of the most popular loss functions in the field of face recognition and speaker recognition and has been used in recent papers in the field of speaker recognition (e.g., [11,32,33]), and all of them have demonstrated superior performance.

AM-Softmax: Inspired by several previous margin losses, Jiankang et al. [34] proposed Additive Angular Margin (AAM) loss, which normalizes the feature vector and weights, adding the angular interval m to

θ

. The angular interval has a more direct effect on the angle than the cosine interval. Geometrically, there is a constant linear angle margin. Unlike AM-Softmax loss, which maximizes the classification interval in cosine space, AAM-Softmax maximizes the classification interval directly in angle space. It has the following equation.

L_{AAM} = - \frac{1}{n} \sum_{i = 1}^{n} log \frac{e^{s (cos (θ_{y_{i}} + m))}}{e^{s (cos (θ_{y_{i}} + m))} + \sum_{j = 1, j \neq y_{i}}^{C} e^{s cos θ_{j}}}

(6)

2.1.4. Triplet Loss

Triplet loss [35] is often used in face recognition tasks. The advantage of triplet loss lies in the distinction of details. That is, when the two inputs are similar, triplet loss can better model the details, which is equivalent to adding a measure of the difference between the two inputs and learning a better representation of the input.

Some scholars also introduce triplet loss into speaker recognition tasks. Ref. [36] proposed a TristouNet, a neural network structure based on LSTM, designed to project speech sequences into a fixed-dimensional Euclidean space. Triplet loss is used in training, so that embedding sequences can be directly compared using Euclidean distance. Ref. [37] proposed a neural network-based triplet loss speaker discrimination training method. The Euclidean distance similarity measure is used in both network training and SV testing to ensure that the SV system is trained in an end-to-end manner and achieves significant performance improvements.

2.1.5. CLLR Loss

The above improvements to the loss function are all designed to optimize the network parameters aiming at a working point on the Detection Error Trade-off (DET) curve. Victoria Mingote et al. [16] proposed a loss function based on the log-likelihood ratio cost, called CLLR loss, to optimize the whole curve. This loss function is designed to minimize the approximation of false alarm and missed alarm rates during network training. The specific formula is as follows:

C t a r (θ) = \sum_{y_{i} \in y_{t a r}} log (1 + e^{- s_{θ} (x_{i}, y_{i})})

(7)

C n o n (θ) = \sum_{y_{i} \in y_{n o n}} log (1 + e^{s_{θ} (x_{i}, y_{i})})

(8)

C L L R (θ) = \frac{1}{2 l o g 2} (\frac{C t a r (θ)}{N_{n o n}} + \frac{C n o n (θ)}{N_{t a r}})

(9)

The use of CLLR loss allows the system to learn how to reduce the expected log costs of target and non-target examples. By designing metric functions that can be directly used as target functions to optimize network training, this kind of metric function with validation task can be more consistent with the SV task.

The above are the popular loss functions currently used in the field of speaker recognition and verification. In this paper, the performance of each loss function is experimentally explored.

2.2. Self-Supervised and Multi-Task Learning

Speech as an acoustic signal contains a variety of information, including vocal features and pronunciation information. Although vocal features and pronunciation information are closely related, in speaker-independent acoustic models for the ASR task, speaker-specific vocalic information needs to be removed, while in speaker recognition, the vocalic feature vector extracted from the whole speech segment cannot be too much influenced by the pronunciation content. Self-supervised learning and multi-task learning allow the network to learn features that are useful for multiple tasks and improve system performance.

2.2.1. Self-Supervised Learning

HuBERT [38], as self-supervised learning (SSL), has achieved excellent results in several speech-related tasks at the same time, such as automatic speech recognition, speaker verification, etc. The HuBERT generates pseudo-labels by clustering speech data in an offline way, randomly masks the input features and predicts the masked pseudo-labels to calculate the loss function. By alternating the clustering and prediction, the network gradually learns the intrinsic features step by step.

DeepCluster [39] was originally an SSL method for the vision domain. This method processes the features directly, groups features iteratively through a clustering algorithm [40] and uses the clustering results as pseudo-labels to update the network weights. It has learned to have a better representation by continuously training so that features are located around a cluster center.

2.2.2. Multi-Task Learning

The multi-task learning [29] features learning multiple related tasks by sharing parameter representations at the same time. If the shared layer is considered as a feature extractor, its optimization goal is to extract feature representations that are valid in each sub-task.

c-vector introduces pronunciation information into the TI-SV task using a multi-task learning approach. The c-vector structure is divided into two parts: one part is the vocal pattern feature vector extraction model, whose output layer nodes correspond to the speaker labels of the training set and are responsible for the speaker classification task during training. The other part and the shared implicit layer form the ASR model, whose output layer nodes correspond to the pronunciation unit labels and are responsible for the pronunciation recognition task during training.

The c-vector adopts the method of training the vocal feature vector extraction model and the ASR model, alternately. The network integrates the vocal features and pronunciation information in speech, and the features extracted from the shared implicit layer will contain both vocal and pronunciation information, which makes the frame-level features better reflect the pronounced counterparts after incorporating the pronunciation information.

There are two categories of training data: speaker data and speech content data. The corresponding two cost functions are the speaker feature vector cost function

L_{S}

with the ASR model cost function

L_{P}

. The final model parameters depend on both

L_{S}

and

L_{P}

working together.

By introducing a pre-trained ASR model into the speaker feature vector extraction model, c-vector introduces pronunciation information at the frame level, which improves the performance of the text-independent speaker recognition system.

3. Proposed Methods

In this section, we propose a pseudo-phoneme label (PPL) loss for the TI-SR task. It integrates content cluster loss at the frame level and speaker recognition loss at the segment level into a unified network by multi-task learning, without additional data requirement and exhausting computation. We also tried different PPL loss implementation methods.

3.1. Pseudo-Phoneme Label Loss

Similar to the HuBERT [38] and DeepCluster [39], we generate pseudo-phoneme labels by deep cluster, where the frame-level features before the pooling layer are clustered to obtain class centers, and the corresponding class that each frame feature belongs to is regarded as pseudo-phoneme labels.

The loss calculation contains two parts: one is the clustering loss, and the other is the speaker loss. We randomly generate the same mean vector as the number of clusters, calculate the similarity between the frame-level features before the pooling layer and compute mean vector frame by frame to obtain the similarity matrix. The similarity matrix is compared with the pseudo-phoneme labels to obtain the clustering loss. When segment-level features are mapped to the speaker category number, we will compute the CE loss by using the ground truth speaker labels. Finally, the two are summed up as the total loss of the model. The overall architecture of the network is shown in Figure 1, and the process of loss calculation is shown in Figure 2.

The cluster loss is calculated as follows: we feed the deep neural network with acoustic features, and then the output before the pooling layer is noted as the frame-level feature

X

,

X = [x_{1}, x_{2}, \dots, x_{l}]

, where l is the number of feature frames, and the feature dimension of each frame is D.

O r i g i n a l

represents the

D \times L

dimensional feature vectors output from the TDNN layer. We use unsupervised clustering methods to obtain the pseudo labels by setting the number of categories, performing the deep cluster method on the frame-level features to obtain the corresponding categories and computing

L_{P P L}

by combining the similarity matrix M with the pseudo-label

L_{p s e u d o}

, where the number of corresponding categories is the pseudo-label

L_{p s e u d o}

. Each mean vector is taken

μ

as a cluster center, where

μ = [μ_{1}, μ_{2}, \dots, μ_{c}]

, c is the number of class centers and D is the dimension. Subsequently, we calculate the distance between frame-by-frame features and class centers to get the similarity matrix

M

, with dimension

t * D

, in order to finish the loss calculation.

In order to get the final loss, we average the classification loss and PPL loss and then add them together. The loss calculation process is as follows:

M = \frac{X^{T} μ}{∥X∥ ∥μ∥}

(10)

L_{p s e u d o} = l a b e l s (c l u s t e r (X))

(11)

L_{P P L} = L_{C E} (M_{i j}, L_{i})

(12)

L = L_{C E} + λ L_{P P L}

(13)

where

M_{i j}

denotes the similarity of the i-th frame feature to the j-th class center,

L_{i}

denotes the pseudo-label of the i-th frame feature, cluster() indicates the clustering process, we use k-means clustering method to obtain pseudo labels and labels() indicates the corresponding label value after clustering, with

λ

as the scale factor.

3.2. Different Implementation Methods

In addition, we compare the effect of different implementation methods of PPL loss on the final system performance.

Version 1: We use the multi-task learning method to jointly train the network. On the one hand, we use the deep cluster method to generate pseudo-phoneme labels for frame-level features and then calculate the PPL loss. On the other hand, we use the speaker labels to calculate the classification loss. See more details in Section 3.1.
Version 2: We directly calculate the PPL loss in the learnable dictionary encoding (LDE) [41,42] pooling layer, train the dictionary components and assign weights through pseudo-phoneme labels, so that the network can better map the frame-level features to the dictionary components, and obtain the phoneme-based pooling results.
Version 3: The PPL loss is calculated at the frame4 layer. In the process of calculating the PPL loss, a method similar to attentive statistics pooling (ASP) [20] is used to calculate the weight of each frame feature through the attention mechanism, and then the obtained weight matrix is multiplied by frame5 to obtain the features based on attention and PPL loss.

4. Experimental Setup

The current deep neural network-based speaker recognition models are mainly x-vector structures [19]. It learns speaker embeddings using speaker features extracted from a speaker-discriminative network. Several CNN structures such as TDNN, Repvgg [28] and ResNet are mostly used in this approach. Typically, the network is trained to classify speakers in the training set. Then we obtain an utterance-level speaker embedding by aggregating speaker features through the pooling layer of the network.

4.1. Dataset

For SV, we used the VoxCeleb dataset [43,44], which contains large-scale celebrity interviews or shows, including VoxCeleb1 and VoxCeleb2. VoxCeleb1 contains over 100,000 utterances for 1251 celebrities, while VoxCeleb2 contains over 1 million utterances for over 6000 celebrities extracted from videos uploaded to YouTube. The utterances were extracted from YouTube videos, and the speech data was collected in a completely realistic environment that contains a variety of noisy scenes that can be used for speaker identification and verification. VoxCeleb1 is split into development (dev) and test sets. There are no overlapping speakers between them.

Our network is trained using only the VoxCeleb1 dev dataset, which contains a total of 1211 speakers with a total of 297,286 voices. Performance of all systems are evaluated on the test set of VoxCeleb1, which contains 4874 utterances from 40 speakers. All speakers with names starting with “E” are reserved for testing, since this gives a good balance of male and female speakers. All test set lists can be found on the VoxCeleb website (https://www.robots.ox.ac.uk/~vgg/data/voxceleb/, accessed on 15 July 2022).

4.2. System Setup

In recent speaker recognition challenge, e.g., NIST SRE [45] and VoxSRC [46], the x-vector (TDNN) [19] is used as a baseline by the organizer for its good performance.

To verify the effect of our proposed PPL loss, we took the classical TDNN structure and statistics pooling (SP) [18] as the baseline and carried out experiments on the VoxCeleb database. The classical TDNN structure consists of five one-dimension convolutional layers, a pooling layer, two fully connected layers and a softmax layer. The outputs of the hidden layers are usually used as the speaker embedding for verification. Specifically, the kernel sizes of five one-dimension convolutional layers at the frame-level are 5, 3, 3, 1 and 1, respectively.

The output of layer 5 is a 1500-dimensional vector, so if the input speech has T frames, we obtain a (

T \times 1500

) vector. The mean and variance of these T 1500-dimensional vectors are calculated and concatenated to be a 3000-dimensional vector. There are two 512-dimensional fully connected layers after the pooling layer. The output of the softmax layer is a (1 × N)-dimensional result vector, where N denotes the number of speakers in the training stage.

For the training with Pytorch, we cut the input speech into chunks of 200 frames for input model training during the network training. See Table 1 for network parameter settings. T represents the number of frames per chunk.

4.3. Implementation Details

The input features of the network are 23-dim Mel-frequency cepstral coefficients (MFCCs), and pitch features were appended, which were extracted from 25 ms frames with 10 ms overlap, mean-normalized over a sliding window of up to 3 s. The energy-based voice activity detection (VAD) is applied to filter out nonspeech frames.

The MUSAN [47] and RIRs [48] are used as noise sources to enhance the training data, and the amount of training data is doubled. The augmentation method followed the same recipe described in Kaldi toolkit [49].

We used the ASV-Subtools toolkit [50] for our experiments; all models are trained with a cosine learning rate decay scheduler with the Adam optimizer, the learning rate set to 0.001 and probabilistic linear discriminant analysis (PLDA) [51] for back-end scoring without score normalization.

To prevent overfitting, we applied a weight decay on all weights in the model of

1 \times 10^{- 3}

. The mini-batch size for training is 64.

5. Experimental Results

5.1. Experiments under Different Loss Functions

We compared the performance of several loss functions. We took equal error rate (EER) and minimum detection cost function (minDCF) as the experimental performance evaluation index.

The experimental results are shown in Table 2.

From Table 2 and Figure 3, we can see that the CLLR loss obtains the optimal performance in the EER, probably because the CLLR loss is more in line with the TI-SV task. Compared to the softmax loss and the triplet loss, the center loss shows its superiority, which potentially allows speaker embeddings to be clustered into more compact representations. However, AM-Softmax loss outperforms center loss on EER and achieves the best performance on minDCF, which demonstrates the effectiveness of the angular margin penalty. However, from the results of AAM-Softmax loss, the angular marginal penalty cannot be too large. It shows that SV systems trained with one of the final verification metrics produce good scores to achieve promising results. In short, it can be seen by comparing the performance of different losses on the SV task. Better performance can be obtained by using a loss function that is more consistent with the final verification metrics or training more discriminative speaker embeddings.

5.2. Results of Different Implementation Methods

The experimental settings of the three versions of PPL loss are: version 1 and 3 have the same network structure and settings, both based on TDNN and SP pooling structures; version 2 uses TDNN and LDE pooling and then calculates PPL loss in the pooling layer. The baseline is the result of not adding the PPL loss. It can be seen from the results in Table 3 that adding PPL loss to the original system can integrate phonetic information into speaker verification systems, thereby improving system performance. Among the three versions of PPL loss implementation methods, version 1 achieves the best minDCF of 0.3234, and version 3 has the best EER of 3.244%. We guess the reason is that both version 1 and version 3 use the features before pooling to calculate PPL loss, and the features contain more speech information. Although version 2 can use phonetic information, it has a limited role because the PPL loss is calculated at the same time as pooling, and the frame-level aggregation and classification process affect each other. As the PPL loss of version 1 has the best performance on minDCF, we chose the PPL loss of version 1 for our subsequent experiments.

5.3. Experiments for Add PPL Loss to TDNN

PPL loss is a supplement to classification loss. In order to test the performance of our proposed PPL loss method, we conducted experiments on the loss function commonly used in speaker recognition. We tested the impact of adding PPL loss to softmax loss, AM-Softmax loss and AAM-Softmax loss, with PPL loss set to a C number of 32 and

λ

value of 0.1. The results are shown in Table 4, and Figure 4 illustrates the DET curves for these conditions.

From the results, we conclude that adding PPL loss to the classification loss can improve the system performance, and PPL loss is applicable to softmax, AM-softmax and AAM-softmax loss. Therefore, it can be determined that adding PPL loss can guide the network training to refine frame-level features and obtain speaker features containing speech information. Moreover, PPL loss can be implemented with only simple clustering pseudo-labels and does not need to prepare any other data labels to train the ASR network.

5.4. Effectiveness of Hyperparameters

In this part, we also compare the effects of different

λ

values on recognition performance using the PPL loss, and the case is equal to using only AM-Softmax loss when

λ

is 0. The results are shown in Table 5.

When

λ

is 0, equivalently using AM-Softmax loss results, we can see from the results in the table that adding the PPL loss obtains better performance than using only AM-Softmax loss, and the best performance is obtained when

λ

is 0.1, indicating that the introduction of speech information into the speaker recognition system needs to be moderate. The introduction of less voice information will not achieve the effect, and too much voice information will also affect the extraction of speaker information.

We explored the effect of adding PPL loss on the results when using AM-Softmax loss, mainly to see the effects of different class center number settings on the results. The case is equal to using only AM-Softmax loss when C is 0, and the experimental results are shown in Table 6.

From the above table, we can see that the EER has the best performance when the number of pseudo-phoneme class is 32, and minDCF has the best result when the number of class is 40. We guess that the reason is that the VoxCeleb dataset is an English-based dataset, and there are 39 phonemes in English according to the CMU pronuncing dictionary [52], so it also proves the effectiveness of clustering to obtain the phoneme information.

5.5. Visualization

Before pooling, we extracted the 10,000 frame-level features and each frame’s pseudo-phoneme labels for visualizing the cluster effects by the t-SNE [53]. The figure above shows the distribution without clustering, and the figure below shows the distribution with clustering. The results are shown in Figure 5.

As can be seen from the figure, the frame-level features generated via pseudo-phoneme labels training with PPL loss are more compact than those without PPL loss because optimizing the network with PPL loss allows the frame-level features to be assigned to the corresponding cluster centers, obtaining richer voice information. Finally, a speaker representation based on phoneme information is obtained.

5.6. Analyses

Through experimental validation, it can be found that phoneme-like features can be learned using PPL loss, which is achieved by jointly computing the deep clustering loss and speaker recognition loss. As the network can make use of pseudo-phoneme labels to enable the system to assign the input feature frames to the corresponding class centers, the segment level uses the CE loss to make the learned embedding map more discriminant.

In contrast to the c-vector, our structure just needs speaker labels and no other annotated data to train the ASR model. The pseudo-phoneme labels are obtained by clustering frame-level features, which greatly reduces the annotation data requirements. At the same time, because the two loss calculations share the front-end frame-level feature extraction layer, only the deep clustering loss needs to add a clustering process, which does not increase the complexity of the network. Additionally, because the two kinds of losses are computed in parallel, they do not affect the inference speed of the network either.

6. Conclusions

In this paper, we propose a pseudo-phomene labels (PPL) loss for text-independent speaker recognition and compare several dominant losses on the VoxCeleb database. By comparing different losses, we conclude that variants of margin loss (AM, AAM) on the sphere and log-likelihood ratio cost (CLLR) loss performs better than the classic Cross-Entropy loss on the text-independent speaker verification task. Adding the PPL loss can boost the performance of most existing loss functions for introducing content information with unsupervised deep clustering at a low computation cost. Additionally, PPL loss is applicable to softmax, AM-softmax and AAM-softmax loss. PPL loss can guide network training to refine frame-level features and obtain speaker features containing speech information. This method does not require additional training of the ASR model and only requires a simple clustering pseudo-label to calculate frame-level information loss, so phoneme information can be introduced into speaker representation, greatly simplifying network computation and model complexity. Experimental results on the VoxCeleb database demonstrate the effectiveness of our proposed method.

Author Contributions

Writing—original draft, M.N.; Writing–review & editing, L.H., Z.F., B.Z. and K.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Kinnunen, T.; Li, H. An overview of text-independent speaker recognition: From features to supervectors. Speech Commun. 2010, 52, 12–40. [Google Scholar] [CrossRef] [Green Version]
Gales, M.J. Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 1998, 12, 75–98. [Google Scholar] [CrossRef] [Green Version]
Campbell, W.; Sturim, D.; Reynolds, D. Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 2006, 13, 308–311. [Google Scholar] [CrossRef]
Larcher, A.; Lee, K.A.; Ma, B.; Li, H. Text-dependent speaker verification: Classifiers, databases and RSR2015. Speech Commun. 2014, 60, 56–77. [Google Scholar] [CrossRef] [Green Version]
Liu, Y.; He, L.; Liu, J.; Johnson, M.T. Introducing phonetic information to speaker embedding for speaker verification. EURASIP J. Audio Speech Music. Process. 2019, 2019, 1–17. [Google Scholar] [CrossRef] [Green Version]
Tejedor-García, C.; Cardeñoso-Payo, V.; Escudero-Mancebo, D. Automatic Speech Recognition (ASR) Systems Applied to Pronunciation Assessment of L2 Spanish for Japanese Speakers. Appl. Sci. 2021, 11, 6695. [Google Scholar] [CrossRef]
Tong, F.; Li, T.; Liao, D.; Xia, S.; Li, S.; Hong, Q.; Li, L. The XMUSPEECH System for Accented English Automatic Speech Recognition. Appl. Sci. 2022, 12, 1478. [Google Scholar] [CrossRef]
Yadav, S.; Rai, A. Learning Discriminative Features for Speaker Identification and Verification. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 2237–2241. [Google Scholar] [CrossRef] [Green Version]
Liu, Y.; He, L.; Liu, J. Large Margin Softmax Loss for Speaker Verification. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 2873–2877. [Google Scholar] [CrossRef] [Green Version]
Li, Y.; Gao, F.; Ou, Z.; Sun, J. Angular Softmax Loss for End-to-end Speaker Verification. In Proceedings of the 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taipei City, Taiwan, 26–29 November 2018; pp. 190–194. [Google Scholar] [CrossRef] [Green Version]
Chagas Nunes, J.A.; Macêdo, D.; Zanchettin, C. Additive Margin SincNet for Speaker Recognition. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–5. [Google Scholar] [CrossRef] [Green Version]
Wei, Y.; Du, J.; Liu, H. Angular Margin Centroid Loss for Text-Independent Speaker Recognition. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 3820–3824. [Google Scholar] [CrossRef]
Li, L.; Nai, R.; Wang, D. Real Additive Margin Softmax for Speaker Verification. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 7527–7531. [Google Scholar] [CrossRef]
Zhang, C.; Koishida, K. End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 1487–1491. [Google Scholar] [CrossRef] [Green Version]
Novoselov, S.; Shchemelinin, V.; Shulipa, A.; Kozlov, A.; Kremnev, I. Triplet Loss Based Cosine Similarity Metric Learning for Text-independent Speaker Recognition. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 2242–2246. [Google Scholar] [CrossRef] [Green Version]
Mingote, V.; Miguel, A.; Ortega, A.; Lleida, E. Log-Likelihood-Ratio Cost Function as Objective Loss for Speaker Verification Systems. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 2361–2365. [Google Scholar] [CrossRef]
Snyder, D.; Ghahremani, P.; Povey, D.; Garcia-Romero, D.; Carmiel, Y.; Khudanpur, S. Deep neural network-based speaker embeddings for end-to-end speaker verification. In Proceedings of the 2016 IEEE Spoken Language Technology Workshop (SLT), San Diego, CA, USA, 13–16 December 2016; pp. 165–170. [Google Scholar] [CrossRef]
Snyder, D.; Garcia-Romero, D.; Povey, D.; Khudanpur, S. Deep Neural Network Embeddings for Text-Independent Speaker Verification. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 999–1003. [Google Scholar] [CrossRef] [Green Version]
Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar] [CrossRef]
Okabe, K.; Koshinaka, T.; Shinoda, K. Attentive Statistics Pooling for Deep Speaker Embedding. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 2252–2256. [Google Scholar] [CrossRef] [Green Version]
Zhu, Y.; Ko, T.; Snyder, D.; Mak, B.; Povey, D. Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification. In Proceedings of the Proc. Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 3573–3577. [Google Scholar] [CrossRef] [Green Version]
Dehak, N.; Kenny, P.J.; Dehak, R.; Dumouchel, P.; Ouellet, P. Front-End Factor Analysis for Speaker Verification. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 788–798. [Google Scholar] [CrossRef]
Jiang, Y.; Song, Y.; McLoughlin, I.; Gao, Z.; Dai, L.R. An Effective Deep Embedding Learning Architecture for Speaker Verification. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 4040–4044. [Google Scholar] [CrossRef] [Green Version]
Wan, L.; Wang, Q.; Papir, A.; Moreno, I.L. Generalized End-to-End Loss for Speaker Verification. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4879–4883. [Google Scholar] [CrossRef] [Green Version]
Lin, W.; Mak, M.W. Wav2Spk: A Simple DNN Architecture for Learning Speaker Embeddings from Waveforms. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 3211–3215. [Google Scholar] [CrossRef]
Ye, F.; Yang, J. A Deep Neural Network Model for Speaker Identification. Appl. Sci. 2021, 11, 3603. [Google Scholar] [CrossRef]
Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. arXiv 2020, arXiv:2005.07143. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-style ConvNets Great Again. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13728–13737. [Google Scholar] [CrossRef]
Ruder, S. An overview of multi-task learning in deep neural networks. arXiv 2017, arXiv:1706.05098. [Google Scholar]
Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A Discriminative Feature Learning Approach for Deep Face Recognition. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 499–515. [Google Scholar]
Wang, F.; Cheng, J.; Liu, W.; Liu, H. Additive Margin Softmax for Face Verification. IEEE Signal Process. Lett. 2018, 25, 926–930. [Google Scholar] [CrossRef] [Green Version]
Jakubec, M.; Jarina, R.; Lieskovska, E.; Chmulik, M. On Deep Speaker Embeddings for Speaker Verification. In Proceedings of the 2021 44th International Conference on Telecommunications and Signal Processing (TSP), Virtual, 26–28 July 2021; pp. 162–166. [Google Scholar] [CrossRef]
Lian, Y.; Pang, J. Improved Far-field Speaker Recognition Method Based Geometry Acoustic Simulation and SpecAugment. In Proceedings of the 2021 International Conference on Intelligent Computing, Automation and Applications (ICAA), Nanjing, China, 25–27 June 2021; pp. 380–387. [Google Scholar] [CrossRef]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–17 June 2019; pp. 4685–4694. [Google Scholar] [CrossRef] [Green Version]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar] [CrossRef] [Green Version]
Bredin, H. TristouNet: Triplet loss for speaker turn embedding. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5430–5434. [Google Scholar] [CrossRef] [Green Version]
Zhang, C.; Koishida, K.; Hansen, J.H.L. Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 1633–1644. [Google Scholar] [CrossRef]
Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
Caron, M.; Bojanowski, P.; Joulin, A.; Douze, M. Deep clustering for unsupervised learning of visual features. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 132–149. [Google Scholar]
Mridha, M.F.; Ohi, A.Q.; Monowar, M.M.; Hamid, M.A.; Islam, M.R.; Watanobe, Y. U-Vectors: Generating Clusterable Speaker Embedding from Unlabeled Data. Appl. Sci. 2021, 11, 79. [Google Scholar] [CrossRef]
Cai, W.; Chen, J.; Li, M. Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System. In Proceedings of the Speaker and Language Recognition Workshop (Odyssey 2018), Les Sables d’Olonne, France, 26–29 June 2018; pp. 74–81. [Google Scholar] [CrossRef] [Green Version]
Cai, W.; Cai, Z.; Zhang, X.; Wang, X.; Li, M. A Novel Learnable Dictionary Encoding Layer for End-to-End Language Identification. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5189–5193. [Google Scholar] [CrossRef] [Green Version]
Nagrani, A.; Chung, J.S.; Zisserman, A. VoxCeleb: A Large-Scale Speaker Identification Dataset. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 2616–2620. [Google Scholar] [CrossRef] [Green Version]
Chung, J.S.; Nagrani, A.; Zisserman, A. VoxCeleb2: Deep Speaker Recognition. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 1086–1090. [Google Scholar] [CrossRef] [Green Version]
Sadjadi, O.; Greenberg, C.; Singer, E.; Mason, L.; Reynolds, D. NIST 2021 Speaker Recognition Evaluation Plan. 2021. Available online: https://www.nist.gov/publications/nist-2021-speaker-recognition-evaluation-plan (accessed on 15 July 2022).
Brown, A.; Huh, J.; Chung, J.S.; Nagrani, A.; Zisserman, A. VoxSRC 2021: The Third VoxCeleb Speaker Recognition Challenge. 2022. Available online: https://www.robots.ox.ac.uk/~vgg/data/voxceleb/competition2021.html (accessed on 15 July 2022).
Snyder, D.; Chen, G.; Povey, D. Musan: A music, speech, and noise corpus. arXiv 2015, arXiv:1510.08484. [Google Scholar]
Ko, T.; Peddinti, V.; Povey, D.; Seltzer, M.L.; Khudanpur, S. A study on data augmentation of reverberant speech for robust speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5220–5224. [Google Scholar] [CrossRef]
Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlicek, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Waikoloa, HI, USA, 11–15 December 2011. [Google Scholar]
Tong, F.; Zhao, M.; Zhou, J.; Lu, H.; Li, Z.; Li, L.; Hong, Q. ASV-SUBTOOLS: Open Source Toolkit for Automatic Speaker Verification. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6184–6188. [Google Scholar] [CrossRef]
Prince, S.J.; Elder, J.H. Probabilistic linear discriminant analysis for inferences about identity. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, 14–21 October 2007; pp. 1–8. [Google Scholar]
Pronunciator, M. CMU Pronouncing Dictionary. 1990. Available online: http://www.speech.cs.cmu.edu/cgi-bin/cmudict (accessed on 15 July 2022).
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. The x-vector architecture for speaker recognition. Using joint learning, the frame-level information from the TDNN layer is compared with the center vector in the PPL layer to calculate the PPL loss, which is then summed with the CE loss to obtain the total loss.

Figure 2. The computation process of PPL loss.

O r i g i n a l

represents the

D \times L

dimensional feature vectors output from the TDNN layer, and

P P L

l a y e r

represents the

D \times C

dimensional center vectors maintained by the learnable dictionary encoding (LDE) layer. The similarity matrix

M

is calculated by the feature of the TDNN layer and the transpose feature of the LDE layer, obtaining a result of a

L \times C

matrix. In the matrix

M

, each row represents the similarity between each frame feature of the input and the C center vectors in the PPL layer, and each color represents a class. After that, the corresponding class of each frame feature is the pseudo-phoneme label

L_{p s e u d o}

according to the clustering result. Finally, the CE loss is calculated between M and

L_{p s e u d o}

to obtain the PPL loss.

Figure 2. The computation process of PPL loss.

O r i g i n a l

represents the

D \times L

dimensional feature vectors output from the TDNN layer, and

P P L

l a y e r

represents the

D \times C

dimensional center vectors maintained by the learnable dictionary encoding (LDE) layer. The similarity matrix

M

is calculated by the feature of the TDNN layer and the transpose feature of the LDE layer, obtaining a result of a

L \times C

matrix. In the matrix

M

, each row represents the similarity between each frame feature of the input and the C center vectors in the PPL layer, and each color represents a class. After that, the corresponding class of each frame feature is the pseudo-phoneme label

L_{p s e u d o}

according to the clustering result. Finally, the CE loss is calculated between M and

L_{p s e u d o}

to obtain the PPL loss.

Figure 3. The detection error trade-off (DET) curve for VoxCeleb test results of the different loss functions.

Figure 4. DET curve for VoxCeleb test results of add PPL loss to different loss functions. (a) CE & CE + PPL; (b) AM & AM + PPL; (c) AAM & AAM + PPL.

Figure 5. The total number of frames is 10,000; the above figure (a) is the feature distribution before training with PPL loss, and the following figure (b) is the feature distribution after training.

Table 1. The embedding TDNN architecture. x-vectors are extracted at layer segment6.

Layer	Layer Context	Total Context	Input × Output
frame1	{t−2,t+2}	5	26 × 512
frame2	{t−2,t,t+2}	9	512 × 512
frame3	{t−3,t,t+3}	15	512 × 512
frame4	{t}	15	512 × 512
frame5	{t}	15	512 × 1500
statistics pooling	{0,T}	T	1500 × 3000
segment6	{0}	T	3000 × 512
segment7	{0}	T	512 × 512

Table 2. Performance comparison of the TI-SV task using different loss functions on VoxCeleb.

Loss Function	EER (%)	minDCF (0.01)
Softmax Loss	3.932	0.4103
Center Loss	3.606	0.3647
AM-Softmax Loss	3.473	0.3469
AAM-Softmax Loss	3.754	0.3657
Triplet Loss(end-to-end)	3.876	0.3961
CLLR Loss(end-to-end)	3.282	0.3594

Table 3. Results of different PPL loss implementation methods on VoxCeleb. (AM-Softmax +

λ

PPL

λ

= 0.1, C = 32).

Table 3. Results of different PPL loss implementation methods on VoxCeleb. (AM-Softmax +

λ

PPL

λ

= 0.1, C = 32).

Implementation Method	EER (%)	minDCF (0.01)
baseline(1&3)	3.473	0.3469
baseline(2)	4.364	0.3797
Version 1	3.321	0.3234
Version 2	4.168	0.3887
Version 3	3.244	0.3459

Table 4. Effect of adding PPL loss on TI-SV task on VoxCeleb. (

λ

= 0.1, C = 32).

Table 4. Effect of adding PPL loss on TI-SV task on VoxCeleb. (

λ

= 0.1, C = 32).

Loss Function	EER (%)	minDCF (0.01)
Softmax Loss	3.932	0.4103
Softmax Loss + $λ$ PPL Loss	3.825	0.3643
AM-Softmax Loss	3.473	0.3469
AM-Softmax Loss + $λ$ PPL Loss	3.321	0.3234
AAM-Softmax Loss	3.754	0.3657
AAM-Softmax Loss + $λ$ PPL Loss	3.412	0.3445

Table 5. Exploration of the effects with different

λ

on VoxCeleb. (AM-Softmax +

λ

PPL).

Table 5. Exploration of the effects with different

λ

on VoxCeleb. (AM-Softmax +

λ

PPL).

$λ$ (C = 32)	EER (%)	minDCF (0.01)
0.00	3.473	0.3469
0.05	3.332	0.3242
0.01	3.367	0.3520
0.10	3.321	0.3234
0.20	3.346	0.3285

Table 6. Exploration of the effect of different C on VoxCeleb. (AM-Softmax +

λ

PPL).

Table 6. Exploration of the effect of different C on VoxCeleb. (AM-Softmax +

λ

PPL).

C ( $λ$ = 0.1)	EER (%)	minDCF (0.01)
0	3.473	0.3469
32	3.321	0.3234
40	3.372	0.3074
48	3.428	0.3343
56	3.362	0.3344

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Niu, M.; He, L.; Fang, Z.; Zhao, B.; Wang, K. Pseudo-Phoneme Label Loss for Text-Independent Speaker Verification. Appl. Sci. 2022, 12, 7463. https://doi.org/10.3390/app12157463

AMA Style

Niu M, He L, Fang Z, Zhao B, Wang K. Pseudo-Phoneme Label Loss for Text-Independent Speaker Verification. Applied Sciences. 2022; 12(15):7463. https://doi.org/10.3390/app12157463

Chicago/Turabian Style

Niu, Mengqi, Liang He, Zhihua Fang, Baowei Zhao, and Kai Wang. 2022. "Pseudo-Phoneme Label Loss for Text-Independent Speaker Verification" Applied Sciences 12, no. 15: 7463. https://doi.org/10.3390/app12157463

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Pseudo-Phoneme Label Loss for Text-Independent Speaker Verification

Abstract

1. Introduction

2. Related Work

2.1. Loss Function

2.1.1. Cross-Entropy Loss

2.1.2. Center Loss

2.1.3. Margin Loss

2.1.4. Triplet Loss

2.1.5. CLLR Loss

2.2. Self-Supervised and Multi-Task Learning

2.2.1. Self-Supervised Learning

2.2.2. Multi-Task Learning

3. Proposed Methods

3.1. Pseudo-Phoneme Label Loss

3.2. Different Implementation Methods

4. Experimental Setup

4.1. Dataset

4.2. System Setup

4.3. Implementation Details

5. Experimental Results

5.1. Experiments under Different Loss Functions

5.2. Results of Different Implementation Methods

5.3. Experiments for Add PPL Loss to TDNN

5.4. Effectiveness of Hyperparameters

5.5. Visualization

5.6. Analyses

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI