Speaker Verification Based on Channel Attention and Adaptive Joint Loss

Fan, Houbin; Li, Jun; Ge, Fengpei; Liang, Chunyan

doi:10.3390/electronics14030548

Open AccessArticle

Speaker Verification Based on Channel Attention and Adaptive Joint Loss

¹

School of Computer Science and Technology, Shandong University of Technology, Zibo 255049, China

²

School of Computer Science and Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(3), 548; https://doi.org/10.3390/electronics14030548

Submission received: 3 December 2024 / Revised: 23 January 2025 / Accepted: 24 January 2025 / Published: 29 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

In deep learning-based speaker verification, the loss function plays a crucial role. Most systems rely on a single loss function, or simply sum multiple losses with manually adjusted weights, increasing experimental complexity and failing to fully leverage the complementary characteristics of different losses. To address this, this paper proposes a speaker verification system based on channel attention and adaptive joint loss optimization. An adaptive joint loss function dynamically adjusts loss weights, allowing the model to better learn the similarities and differences of speakers, narrowing the gap between closed- and open-set testing, and enhancing generalization ability. A channel attention squeeze-and-excitation module is designed to improve the network’s ability to extract channel-specific features. On the AISHELL-1 dataset, the system achieved an equal error rate of 0.84% and a minimum detection cost function of 0.0528. Experimental results demonstrate a significant improvement in speaker verification performance, confirming the effectiveness of the proposed system.

Keywords:

speaker verification; feature extraction; joint loss function; channel attention

Graphical Abstract

1. Introduction

Speaker verification is a binary classification problem aimed at determining whether a given speech sample comes from a specific person of interest. Traditional speaker verification frameworks based on factor analysis have gradually been replaced by deep learning algorithms. In 2014, Lei et al. [1] replaced the traditional Gaussian mixture model with a deep neural network model, obtaining the d-vector by applying L2 regularization to the speaker features extracted from the final hidden layer. In 2017, Snyder et al. [2] proposed the x-vector, a feature representation extracted from time delay neural networks (TDNNs). Using a statistical pooling layer, frame-level features are aggregated into segment-level features, allowing the output layer to capture long-term characteristics. In 2020, Brecht et al. [3] proposed the ECAPA-TDNN model by incorporating the Res2Net structure [4] and a channel attention mechanism into the TDNN architecture. They introduced the SE-Res2Block, a one-dimensional squeeze-and-excitation block; employed dilated convolutions to enhance feature representation; and used a statistical pooling layer to fuse global and local information. Model performance was improved by concatenating the outputs of the SE-Res2Blocks with those of the initial convolutional layer.

In the SE-Res2Block, after stacking multiple convolutional layers, features are fused through a combination process. However, this fusion lacks channel information from the entire network, which can lead to the loss of important feature details, affecting system performance. With ongoing research advancements, more deep residual networks and Squeeze-Excitation modules have been applied to speaker verification tasks [5,6,7]. Additionally, incorporating attention mechanisms into deep networks has become an effective method to enhance model classification performance [8,9,10]. To address these issues, we propose a Channel Attention Module Squeeze-Excitation Res2Block (CAM-SE), which introduces a channel attention mechanism into the SE-Res2Block and uses parallel global average pooling and global max pooling to capture global statistical information for each channel.

Loss functions play an important role in deep learning-based speaker verification tasks. The normalized exponential function (Softmax) is one of the most commonly used loss functions for training speaker feature neural networks, but it does not accommodate the characteristics of similarity measurement well. In 2015, Hadsell et al. [11] introduced contrastive loss, which brings the speech features of the same speaker closer together and separates those of different speakers, emphasizing similarity measurement and performing better at speaker verification tasks. Dey et al. [12] improved contrastive loss by introducing triplet loss, which uses anchor, positive, and negative samples to help the model learn speaker features with clear differences, thus improving verification performance. In 2017, Liu et al. [13] proposed large margin loss, which learns the angular differences between features, further enhancing the discriminative power of feature representations.

Subsequently, Softmax variants AMSoftmax [14] and AAMSoftmax [15] were applied to speaker verification tasks during both training and testing. These allow the learned speaker features to follow an angular distribution, which aligns with the cosine similarity scoring backend. By introducing a cosine margin, they quantitatively control the decision boundary between training speakers to minimize intra-class variance. Both variants combine classification and metric learning by introducing a margin in the Softmax loss to maximize inter-class distance and minimize intra-class distance. However, in speaker verification, closed-set training is a multi-class classification problem, and open-set testing is a binary classification problem. Therefore, the multi-class nature of these methods causes a mismatch between training and testing.

Han et al. [16] introduced the binary classification loss function SphereFace2 [17] into speaker verification, specifically performing binary classification on a hypersphere. During training, it transforms the classification problem into binary classification problems, constructing binary classification objectives. Data from the target class are treated as positive samples, while data from the other classes are treated as negative samples, allowing for pairwise comparisons in both the training and testing sets. This approach mitigates the mismatch between closed-set training and open-set testing. However, since SphereFace2 is a binary classification function, it classifies any data that do not belong to the target speaker as negative samples, leading to weaker generalization performance.

In 2023, Feng et al. [18] introduced the InfoNCE algorithm for speaker recognition, combining it with cross-entropy loss to form a joint loss function. However, this joint loss function manually adjusts the weights of different losses without fine-tuning, which may weaken learning performance due to the uncertainty of different losses. Moreover, manually setting weights for different tasks adds complexity.

This paper combines the advantages of AAMSoftmax and SphereFace2 to address the above issues. While considering the differences between speakers and the similarities between the same speaker’s samples, it also considers the model’s generalization ability. Based on the ECAPA-TDNN model, we propose a speaker verification system based on channel attention and adaptive joint loss (CAA-TDNN). The core idea is to dynamically adjust the weights based on the relative magnitudes of two loss functions, allowing for efficient resource allocation during training and hence improving overall model performance. It can determine speaker similarity through dynamic classification metrics, adapting to datasets of different scales and demonstrating high generalization capability.

2. Related Works

2.1. Speaker Verification System Framework

Speaker verification systems generally consist of signal preprocessing, feature extraction, and identity determination, with a basic framework as shown in Figure 1. The system preprocesses the speech data, including pre-emphasis, framing, and windowing. Features are then extracted using methods such as Mel-frequency cepstral coefficients (MFCCs) [19] or filter bank features (FBank) [20]. During training, the system builds and optimizes a model based on the speaker’s parameter features. During verification, the processed enrolled and test speech are inputted to the speaker verification model for matching and scoring, and the recognition result is determined based on the matching score.

2.2. ECAPA-TDNN

ECAPA-TDNN is a mainstream model for speaker verification tasks, whose network structure is shown in Figure 2. In this model, data are passed through a statistical pooling layer, processed, and classified into speaker categories using a loss function. The model has the following key components:

Dimensional Squeeze-Excitation Res2Blocks (1-D SE-Res2Blocks): The first layer reduces the feature dimension, and the second restores the feature count to the original dimension. The entire unit scales all channels through the SE block and uses skip connections.
Channel- and Context-Dependent Statistics Pooling (ASP): The attention mechanism on the channels allows the model to focus more on speaker-specific features. The local input is concatenated with the global unweighted mean and standard deviation in the time domain, so as to focus more on global properties.
Multi-Layer Feature Aggregation and Summation (MFA): The outputs from all SE-Res2Blocks are concatenated, and the outputs of the previous SE-Res2Block output and initial convolution layer are used as input to each layer. This effectively integrates features from multiple layers.

In the SE-Res2Block, speaker features pass through a 1 × 1 convolutional layer, followed by a 3 × 3 convolutional layer. During the convolution process, the features are divided into

s

subsets,

\{x_{1}, x_{2}, \dots, x_{s}\}

, where

s

is the number of subsets. After the split, except for

x_{1}

, each subset

x_{i} (i \in \{2, 3, \dots, s\})

is passed through the 3 × 3 convolutional layer

K_{i}

. After processing through

x_{3}

, each output of

K_{i - 1}

is added to

x_{1}

and then processed by

K_{i}

. The output is

y_{i} = \{\begin{matrix} \begin{array}{l} x_{i}, & i = 1; \\ K_{i} (x_{i}), & i = 2; \\ K_{i} (x_{1} + y_{i - 1}), & i = 3, 4, \dots, s \end{array} \end{matrix}

(1)

where

\{y_{1}, y_{2}, \dots, y_{s}\}

is the module’s output, as shown in Figure 2. After concatenation, these are inputted to the next 1 × 1 convolutional layer, and the channel information is merged. The output feature subset

y_{i}

benefits from residual connections, which increase the receptive field and allow for more effective extraction of speaker features. The subsets

y_{i}

are summed to produce the output feature

y

. Although the SE-Res2Block module can achieve a larger receptive field, the direct summation of feature subsets

y

leads to the loss of channel information after 3 × 3 convolution, limiting its feature extraction ability.

3. Channel Attention and Adaptive Joint Loss

3.1. CAA-TDNN

We propose a speaker verification system based on task loss uncertainty-weighted adaptive joint loss optimization, which leverages multiple loss functions. The system involves both the network architecture and the joint optimization of the loss functions, with the structure and interconnections of the network components shown in Figure 3.

The speaker verification system based on task loss uncertainty-weighted adaptive joint loss optimization performs the speaker verification task by sharing feature extraction network parameters and using both multi-class and binary classification losses. Table 1 shows the network parameters. The extracted features pass through the TDNN blocks and then enter the CAM-SE module. Before the residual concatenation output at each layer, the features are passed through the channel attention module (CAM) to preserve the channel information. In the SE block, the channel and bottleneck sizes are set to 1024 and 256, respectively. A 192-dimensional speaker embedding is extracted and is optimized using the adaptive joint loss function (AJ-LF).

The CAM-SE module comprises layers of dilated convolution, normalization, global average pooling, and global max pooling and incorporates a residual structure. The CAM block has two parallel pooling paths: one using global average pooling and the other global max pooling. These paths compress the input feature dimensions, and the features are then passed forward to a shared multi-layer perceptron (MLP) network. After adding the two outputs, channel attention is obtained through an activation function. Figure 4 shows the module structure.

3.2. AAMSoftmax

Used as the output layer, AAMSoftmax [15] enables the learned features to follow an angular distribution, aligning with the cosine similarity backend. It introduces a cosine margin, which quantitatively controls the decision boundaries between speakers, minimizing intra-class variance and improving discriminative performance. It encourages features of the same class to be closer together while ensuring sufficient angular separation between features of different classes, thus enhancing discriminability and improving classification accuracy. The AAMSoftmax [15] loss function can be expressed as

L_{A A M} = - \frac{1}{N} \sum_{i = 1}^{N} \log \frac{e^{s (\cos θ_{y i} + m)}}{e^{s (\cos θ_{y_{i}} + m)} + \sum_{j = 1, j \neq y_{i}}^{C} e^{s \cos θ_{j}}}

(2)

where

N

is the number of speech samples in the training set,

C

is the number of speaker classes,

y_{i}

represents the class to which a sample belongs,

θ_{j}

is the angle between the sample

i

and the center vector of class

j

,

s

is the scaling parameter, and

m

is the margin parameter. During each training iteration, the loss function averages the losses for all samples and performs backpropagation to update the network parameters. We can adjust

m

and

s

such that the loss function tightens the decision boundary, significantly reducing the intra-class distance while increasing the inter-class distance.

3.3. SphereFace2

SphereFace2 [17] is a binary classification function that is primarily used in the field of face recognition. Unlike traditional multi-class classification loss functions, it transforms the multi-class problem into

K

binary classification problems, enabling more efficient feature learning. SphereFace2 constructs a binary classifier for each class

i

, where positive sample features are constructed using the speaker information from class

i

, while speaker information from other classes is used as negative samples. The weight of class

i

is

W_{i}

. The loss function can initially be expressed as

L_{s f} = \log (1 + \exp (- W_{y}^{T} x - b_{y})) + \sum_{i \neq y}^{K} \log (1 + \exp (- W_{y}^{T} x + b_{i}))

(3)

where

x

denotes the deep features, with corresponding labels

y

, and

L_{s f}

is the sum of the standard binary classification losses.

During training, the speaker verification process involves

K

binary classification problems, where the classifier can only construct one positive sample and

K - 1

negative samples, resulting in a highly imbalanced dataset. To address this issue, a parameter

λ \in [0, 1]

is introduced to balance the gradients of the positive and negative samples, and a parameter

r

is used to scale the cosine similarity and optimize the loss. The loss function can be expressed as

L_{s f} = λ \cdot \log (1 + \exp (- r \cdot \cos (θ_{y_{i}}) - b_{y})) + (1 - λ) \sum_{i \neq y}^{K} \log (1 + \exp (r \cdot \cos (θ_{i}) + b_{i}))

(4)

where

b

is the bias. In the experiments, to simplify the parameters,

b = b_{i} = b_{y}

is set such that the decision boundary becomes

r \cdot \cos (θ_{i}) + b = 0

, improving the stability of training. To increase compactness within a class, an adjustable margin parameter

n

is introduced. The final decision boundary can be expressed as

r \cdot (\cos (θ) - n) + b = 0

, and the SphereFace2 loss function as

L_{s f} = λ \cdot \log (1 + \exp (- r \cdot (g (\cos (θ_{y_{i}})) - n) - b)) + (1 - λ) \sum_{i \neq y}^{K} \log (1 + \exp (r \cdot (g (\cos (θ_{i})) - n) + b_{i}))

(5)

A function,

g (z) = 2 {(\frac{z + 1}{2})}^{t} - 1

, addresses the issue of similarity score overlap caused by distribution inconsistency between positive and negative pairs. During training,

g (\cos (θ))

replaces the original cosine similarity.

3.4. Loss Optimization

The proposed AJ-LF combines the traditional multi-class loss function AAMSoftmax with the binary classification loss function SphereFace2. While focusing on speech separability and similarity, it enables pairwise training during model training, thereby reducing the gap between closed- and open-set training. The joint loss function can be expressed as

L o s s = σ L_{A A M} + (1 - σ) L_{s f}

(6)

The adaptive hyperparameter

σ \in [0, 1]

adjusts the weights of the two loss functions and can be dynamically adjusted during training based on the relative magnitudes of the two losses. The loss weight is

σ = \frac{1}{1.0 + \exp (L_{A A M} - L_{s f})}

(7)

The AJ-LF can dynamically adjust the relative importance of different losses based on their variations during training, enabling the model to more effectively learn different features in a balanced way. While focusing on the separability of different speakers, it also enhances the compactness within the same speaker class.

4. Experiment

4.1. Experimental Setup

The dataset used in this study is AISHELL-1 [21], which includes a training set with 120,421 speech samples from 340 speakers and a test set with 7176 speech samples from 20 speakers, totaling 143,120 test pairs. For each speaker, one anchor sample, one positive sample, and 99 negative samples are randomly selected to form a test pair, and the cosine similarity between the anchor sample and the other 100 speech samples is calculated. To test the system’s robustness and generalization ability, we also trained our system using the VoxCeleb2 [22] dataset, which includes 5994 speakers and 1,092,009 utterances in its development set. In our experiments, the evaluation tests were conducted on three versions of the VoxCeleb1 [23] test set: Vox1_O, Vox1_E, and Vox1_H. Additionally, data augmentation techniques were applied, including simulating reverberation using the RIR [24] dataset and adding noise from the MUSAN [25] dataset.

All experiments in this paper were conducted on the PyTorch platform [26]. No speech enhancement was applied at the front end of the system. In the experiments, 80-dimensional FBank features were extracted, with a frame length of 25 milliseconds and a frame shift of 10 milliseconds. The Adam [27] optimizer was used, with momentum and weight decay of 0.97 and 2 × 10⁻⁵, respectively. The batch size was set to 128, the initial learning rate was 0.001, and the number of epochs was 100. The model was evaluated using different loss functions, as well as the AJ-LF. In AAMSoftmax, the boundary and scaling parameters were set to

m = 0.2

and

s = 30

, respectively. In SphereFace2, we set

λ = 0.7

,

n = 0.2

,

r = 30

, and

t = 3

.

The equal error rate (EER) [28] is a key metric for evaluating speaker recognition performance, with lower values indicating better performance. It is defined as

E E R = F A R (t h) = F R R (t h)

(8)

where

t h

is the threshold, and

F A R (t h)

and

F R R (t h)

are the respective false-acceptance and false-rejection rates at threshold

t h

.

The minimum detection cost function (MinDCF) [29] is an evaluation method defined in the NIST Speaker Recognition Evaluation [28], with lower values indicating better performance. MinDCF is the sum of the costs associated with

F A R (t h)

and

F R R (t h)

when minimizing the total cost, i.e.,

M i n D C F = m i n {C_{F R} P_{t a r} R_{F R} + C_{F A} P_{n o n} R_{F A}}

(9)

where

C_{F R}

and

C_{F A}

are the respective false-rejection and false-acceptance costs;

P_{t a r}

and

P_{n o n}

are the respective prior probabilities of true speaker tests and impostor tests; and

R_{F R}

and

R_{F A}

are the respective false-rejection and false-acceptance rates.

4.2. Experimental Results and Analysis

4.2.1. Ablation Experiments

To investigate the effectiveness of each module in the speaker verification system, ablation experiments were conducted on the AISHELL-1 dataset. We used ECAPA-TDNN as the baseline model (with AAMSoftmax as the loss function) and performed experiments on the original network, original network with AJ-LF, original network with CAM-SE, and original network with both CAM-SE and AJ-LF.

Table 2 shows the experimental results, from which we see that adding both CAM-SE and AJ-LF effectively improves system performance. When the CAM-SE module is added to the original network, EER and MinDCF decrease by 25% and 18%, respectively. This demonstrates the effectiveness of introducing the CAM in the SE-Res2Block. The inclusion of the AJ-LF reduces the system’s EER by 18%, with a slight increase in the MinDCF value. This indicates that, after adding AJ-LF, the system can more effectively balance the learning of different features, resulting in better generalization performance.

After adding both proposed modules to the system, the performance on the dataset surpasses that of the original network and the networks with individual modules. EER and MinDCF on the AISHELL-1 dataset reached 0.84% and 0.0528, respectively, representing respective reductions of 27% and 8% compared with the original network. This demonstrates that the proposed CAM-SE and AJ-LF modules place more emphasis on channel information and can dynamically adjust the loss weights during training, resulting in better performance in speaker verification tasks.

4.2.2. Comparison with Existing Methods

To investigate the effectiveness of CAA-TDNN, the experimental results of the proposed method are compared with those of existing methods on the AISHELL-1 dataset, as shown in Table 3. It can be seen that the proposed system outperforms other systems in terms of the EER metric; specifically, EER is reduced by 18% compared with the current mainstream model ECAPA-TDNN [3]. Additionally, compared with the approach of Grigoriu [30], which uses a modified ResNet-50 as the backbone and multi-loss fusion for joint training, EER is reduced by 44%. Compared with the Thin ResNet-34 network used in reference [31], the performance of the proposed system is also greatly improved. The performance improvement of the proposed system is mainly attributed to the use of the CAM and adaptive joint loss optimization.

Additionally, the visualization method proposed by Maaten et al. [32] is used to reduce the dimensionality of the features extracted by the proposed CAA-TDNN system, resulting in a t-SNE visualization plot. The test set of the AISHELL-1 dataset, consisting of 7176 utterances from 20 speakers, is used for testing, with different symbols and colors representing the 20 speakers. Figure 5a shows the visualization of features extracted by ResNet34, where the features are not clearly classified, with large inter-class and intra-class distances, and many misclassifications, indicating weak classification ability. Figure 5b shows the visualization of features extracted by the Thin ResNet-34 model, with improved classification accuracy compared with Figure 5a, though the inter-class distances are still not distinct for some classes. Figure 5c shows the feature visualization from the ECAPA-TDNN model, which has larger inter-class distances, smaller intra-class distances, and a significant improvement in accuracy compared with Figure 5b. Compared to the model discussed earlier, Figure 5d shows the feature visualization from the proposed CAA-TDNN model, which has higher classification accuracy and demonstrates tighter intra-class distances and more distinct inter-class distances. This indicates that the channel attention squeeze-excitation module (CAM-SE) enhances the model’s ability to distinguish features, and the AJ-LF further improves speaker verification performance by focusing on both inter-speaker separation and intra-speaker compactness.

4.2.3. The Impact of σ on Model Performance

To verify the adjustment capability of the AJ-LF, in the experiment, the hyperparameter σ in Equation (6) was assigned values ranging from 0.1 to 0.9. Using the same network architecture and training method, nine different networks were trained and compared with the network using adaptive joint loss. Figure 6 shows the speaker verification performance with different values of σ. From the results, it is evident that the value of σ has a significant influence on the system. When σ is large, the contribution of AAMSoftmax becomes more important, and the model focuses more on the separation of different samples. When σ is small, the contribution of SphereFace2 becomes more important, and the model emphasizes intra-class compactness. When trained with the AJ-LF, the two losses can mutually explore their correlation to adapt to the shared parameters in the backbone network, thus learning more discriminative speaker features. A comprehensive analysis of both evaluation metrics shows that the system demonstrates the best speaker verification performance when using the AJ-LF, demonstrating the effectiveness of the proposed loss optimization method.

4.2.4. Performance Comparison of Different Loss Functions

To validate the effect of the proposed AJ-LF on the model, its performance was experimentally compared with that of various loss functions under different models, with results as shown in Table 4, indicating that applying the AJ-LF leads to an improvement in verification performance. In the ResNet34 model [16], compared with using the single loss function SphereFace2, EER is similar, but MinDCF is reduced by 23%. In the ECAPA-TDNN model, compared with the AAMSoftmax loss function used in the original network, EER is reduced by 18%. Therefore, the AJ-LF can balance the separation of different speakers while maintaining intra-class compactness, outperforming the single loss functions. The experimental results show that the proposed speaker verification system, based on channel attention and adaptive joint loss, has a significant advantage over other methods in the speaker verification task, both in terms of EER and MinDCF.

4.2.5. Robustness and Generalization Ability

In the experiments mentioned above, the proposed method achieved good performance on the AISHELL-1 Chinese dataset without data augmentation. However, in practical applications, speaker verification systems require substantial data support. Moreover, to demonstrate the robustness of the system, it is necessary to achieve good results in various noise environments. To verify the robustness and generalization performance of the proposed method, this section will conduct training on the Voxceleb2 dataset, and the test datasets will be Vox1_O, Vox1_E, and Vox1_H. Additionally, based on the MUSAN dataset (music, speech, noise) and the RIR dataset (reverberation), we have incorporated Reverberation Enhancement and Noise Enhancement in the experiments. Specifically, five enhancement strategies were employed: adding reverberation, adding speech, adding music, adding noise, and adding a mixture of speech and music. Each strategy has an equal application probability (0.2) and is randomly selected and applied during the training phase.

Table 5 shows the EER and the MinDCF of the baselines and the proposed models on the Vox1_O, Vox1_E, and Vox1_H datasets. Compared with the baseline ECAPA-TDNN, the EER was relatively reduced by 14.9%, 17.8%, and 21.7% respectively, and the MinDCF was relatively reduced by 22.4%, 21.6%, and 17.4%, respectively. Compared with other advanced networks, the method proposed in this paper also achieved good results under the two metrics of EER and MinDCF. From the experimental results, it can be seen that the method proposed in this paper can achieve good results in training with large-capacity datasets and cross-dataset testing, fully demonstrating the generalization ability of the system. At the same time, after adding noise enhancement, the robustness of the system is significantly improved, and good results can be achieved when using test sets under different conditions.

We also conducted tests on low-bitrate samples to simulate real-world scenarios with lower audio quality. We tested CAA-TDNN and the baseline model ECAPA-TDNN on the AISHELL-1 test set at two different sampling rates: 8 kHz and 16 kHz. The results are shown in Table 6. From the experimental results, it can be observed that CAA-TDNN outperforms the baseline model at all different sampling rates.

5. Conclusions

To optimize the loss function more effectively, this paper proposes an AJ-LF that combines binary classification loss with traditional multi-classification loss, while dynamically adjusting the loss weights. This approach eliminates the need for manually setting weights and allows the model to better capture the similarities and differences between speakers during classification. It also narrows the gap between closed-set training and open-set testing, thereby enhancing the model’s generalization ability. Additionally, CAM is introduced to improve the network’s ability to extract channel-wise features, reducing information loss. Experiments and robustness tests on different datasets validate the effectiveness of the proposed modules, and comparisons with other networks demonstrate the advantages of the method. Results show that the proposed approach enhances performance across various networks and achieves better speaker verification results compared to using a single loss function. By incorporating the channel attention module, the model’s performance in speaker verification tasks has been significantly improved. In future work, we plan to conduct experiments with more complex network architectures to further explore the potential of the proposed method.

Author Contributions

Conceptualization, methodology, software, validation, data curation, writing—original draft preparation, visualization, H.F.; formal analysis, investigation, resources, H.F. and J.L.; writing—review and editing, H.F., F.G., J.L., and C.L.; supervision, project administration, funding acquisition, F.G. and C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the National Natural Science Foundation of China (12204062) and the Shandong Provincial Natural Science Foundation (ZR2022MF330).

Data Availability Statement

Data are contained within the article.

Acknowledgments

We thank LetPub (www.letpub.com.cn) for its linguistic assistance during the preparation of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yun, L.; Scheffer, N.; Ferrer, L. A novel scheme for speaker recognition sing a phonetically-aware deep neural network. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 1695–1699. [Google Scholar]
Snyder, D.; Garcia-Romero, D.; Povey, D.; Khudanpur, S. Deep Neural Network Embeddings for Text-Independent Speaker Verification. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 999–1003. [Google Scholar]
Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 3830–3834. [Google Scholar]
Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef]
Liu, B.; Chen, Z.; Wang, S.; Wang, H.; Han, B.; Qian, Y. DF-ResNet: Boosting Speaker Verification Performance with Depth-First Design. In Proceedings of the INTERSPEECH 2022, Incheon, Korea, 18–22 September 2022; pp. 296–300. [Google Scholar]
Zhou, T.; Zhao, Y.; Wu, J. Resnext and res2net structures for speaker verification. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 301–307. [Google Scholar]
Zhu, W.; Kong, T.; Lu, S.; Li, J.; Zhang, D.; Deng, F.; Wang, X.; Yang, S.; Liu, J. SpeechNAS: Towards better trade-off between latency and accuracy for large-scale speaker verification. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; pp. 1102–1109. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV) 2018, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Yip, J.Q.; Truong, T.; Ng, D.; Zhang, C.; Ma, Y.; Nguyen, T.H.; Ni, C.; Zhao, S.; Chng, E.S.; Ma, B. Aca-net: Towards lightweight speaker verification using asymmetric cross attention. arXiv 2023, arXiv:2305.12121. [Google Scholar]
Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; Volume 2, pp. 1735–1742. [Google Scholar]
Dey, S.; Madikeri, S.R.; Motlicek, P. End-to-end Text-dependent Speaker Verification Using Novel Distance Measures. In Proceedings of the INTERSPEECH 2018, Hyderabad, India, 2–6 September 2018; pp. 3598–3602. [Google Scholar]
Liu, Y.; He, L.; Liu, J. Large margin softmax loss for speaker verification. arXiv 2019, arXiv:1904.03479. [Google Scholar]
Xiang, X.; Wang, S.; Huang, H.; Qian, Y.; Yu, K. Margin matters: Towards more discriminative deep neural network embeddings for speaker recognition. In Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Lanzhou, China, 18–21 November 2019; pp. 1652–1656. [Google Scholar]
Hajibabaei, M.; Dai, D. Unified hypersphere embedding for speaker recognition. arXiv 2018, arXiv:1807.08312. [Google Scholar]
Han, B.; Chen, Z.; Qian, Y. Exploring Binary Classification Loss for Speaker Verification. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Wen, Y.; Liu, W.; Weller, A.; Raj, B.; Singh, R. Sphereface2: Binary classification is all you need for deep face recognition. arXiv 2021, arXiv:2108.01513. [Google Scholar]
Feng, T.; Fan, H.; Ge, F.; Cao, S.; Liang, C. Speaker Recognition Based on the Joint Loss Function. Electronics 2023, 12, 3447. [Google Scholar] [CrossRef]
Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
Wang, J.; Li, L.; Wang, D.; Zheng, T.F. Research on generalization property of time-varying Fbank-weighted MFCC for i-vector based speaker verification. In Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, Singapore, 12–14 September 2014; p. 423. [Google Scholar]
Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In Proceedings of the 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), Hsinchu, Taiwan, 17–19 October 2017; pp. 1–5. [Google Scholar]
Chung, J.S.; Nagrani, A.; Zisserman, A. Voxceleb2: Deep speaker recognition. arXiv 2018, arXiv:1806.05622. [Google Scholar]
Nagrani, A.; Chung, J.S.; Zisserman, A. Voxceleb: A large-scale speaker identification dataset. arXiv 2017, arXiv:1706.08612. [Google Scholar]
Ko, T.; Peddinti, V.; Povey, D.; Seltzer, M.L.; Khudanpur, S. A study on data augmentation of reverberant speech for robust speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5220–5224. [Google Scholar]
Snyder, D.; Chen, G.; Povey, D. Musan: A music, speech, and noise corpus. arXiv 2015, arXiv:1510.08484. [Google Scholar]
Snyder, D.; Garcia-Romero, D.; Sell, G.; McCree, A.; Povey, D.; Khudanpur, S. Speaker recognition for multi-speaker conversations using x-vectors. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 5796–5800. [Google Scholar]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Kinnunen, T.; Evans, N.; Yamagishi, J.; Lee, K.A.; Sahidullah, M.; Todisco, M.; Delgado, H. Asvspoof 2017: Automatic speaker verification spoofing and countermeasures challenge evaluation plan. Training 2017, 10, 1508. [Google Scholar]
Martin, A.; Przybocki, M. The NIST 1999 speaker recognition evaluation—An overview. Digit. Signal Process. 2000, 10, 1–18. [Google Scholar] [CrossRef]
Li, X.; Hu, X.; Chen, X.; Pan, H.; Niu, K. Deep speaker embedding using hybrid network of multi-feature aggregation and multi-loss fusion for TI-SV. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 506–512. [Google Scholar]
Xie, W.; Nagrani, A.; Chung, J.S.; Zisserman, A. Utterance-level aggregation for speaker recognition in the wild. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 5791–5795. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing data using T-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Yan, H.; Lei, Z.; Liu, C.; Zhou, Y. Gmm-Resnext: Combining Generative and Discriminative Models for Speaker Verification. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 11706–11710. [Google Scholar]
Cao, D.; Wang, X.; Zhou, J.; Zhang, J.; Lei, Y.; Chen, W. LightCAM: A Fast and Light Implementation of Context-Aware Masking based D-Tdnn for Speaker Verification. arXiv 2024, arXiv:2402.06073. [Google Scholar]

Figure 1. Frame diagram of the speaker verification system.

Figure 2. Network topology of ECAPA-TDNN (C and T correspond to respective channel and time dimensions of intermediate feature map).

Figure 3. Proposed network topology (CAA-TDNN).

Figure 4. CAM structure.

Figure 5. t-SNE reduced dimensional view of different networks. (a) ResNet34; (b) Thin ResNet-34; (c) ECAPA-TDNN; (d) CAA-TDNN.

Figure 6. Effect of different σ values on the system.

Table 1. Network parameters of model.

#	Module Name	Structure	Out
1	Conv1D + ReLU + BN	K = 5, P = 1	(128,80,202)
2	CAM-SE	K = 3, D = 2	(128,1024,202)
	CAM-SE	K = 3, D = 3	(128,1024,202)
	CAM-SE	K = 3, D = 4	(128,1024,202)
	Conv1D + ReLU	K = 1	(128,1536,202)
3	ASP + BN	-	(128,3072)
4	FC + BN	-	(128,192)

Table 2. Effect of different modules on results.

BASE	CAM-SE	AJ-LF	EER (%)	MinDCF
√			1.16	0.0574
√	√		0.87	0.0468
√		√	0.90	0.0630
√	√	√	0.84	0.0528

Table 3. Comparison of results of the proposed method with those of existing methods on the AISHELL-1 dataset.

Model	Loss	EER (%)
ECAPA-TDNN	AAMsoftmax	1.16
ResNet34	SphereFace2	1.96
Thin ResNet-34	AMsoftmax	1.85
Modified ResNet-50	AAMsoftmax	1.70
Modified ResNet-50	Softmax + AAMsoftmax	1.52
CAA-TDNN	AJ-LF	0.84

Table 4. Performance comparison of different loss functions.

Model	Loss	EER (%)	MinDCF
ResNet34 [16]	Softmax	1.97	0.0956
	AMsoftmax	1.96	0.0916
	AAMsoftmax	1.73	0.0897
	SphereFace2	1.96	0.1133
	AJ-LF (ours)	1.96	0.0870
ECAPA-TDNN [3]	Softmax	1.38	0.1160
	AMsoftmax	1.68	0.0637
	AAMsoftmax	1.16	0.0574
	SphereFace2	1.10	0.0660
	AJ-LF (ours)	0.90	0.0630
CAA-TDNN	Softmax	1.13	0.0726
	AMsoftmax	1.26	0.0690
	AAMsoftmax	0.93	0.0667
	SphereFace2	1.04	0.0632
	AJ-LF (ours)	0.84	0.0528

Table 5. EER and MinDCF performance of all systems on the standard VoxCeleb1.

Model	Params	Loss	Vox1_O		Vox1_E		Vox1_H
Model	Params	Loss	EER(%)	DCF	EER(%)	DCF	EER(%)	DCF
ResNet34 [16]	23.9 M	SphereFace2	0.88	0.0757	0.97	0.1065	1.76	0.1699
ECAPA-TDNN [3]	14.7 M	AAMsoftmax	0.87	0.1066	1.12	0.1318	2.12	0.2101
GMM-ResNext [33]	-	AAMsoftmax	0.96	0.1168	1.20	0.1424	2.31	0.2247
LightCAM [34]	8.15 M	AAMsoftmax	0.83	0.0891	0.95	0.1114	1.86	0.1922
CAA-TDNN	14.86 M	AJ-LF (ours)	0.74	0.0827	0.92	0.1033	1.66	0.1735

Table 6. Experimental results of the model under different sampling rates.

Model	8 kHz		16 kHz
Model	EER(%)	DCF	EER(%)	DCF
ECAPA-TDNN	16.20	0.8605	1.16	0.0574
CAA-TDNN	12.87	0.6934	0.84	0.0528

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fan, H.; Li, J.; Ge, F.; Liang, C. Speaker Verification Based on Channel Attention and Adaptive Joint Loss. Electronics 2025, 14, 548. https://doi.org/10.3390/electronics14030548

AMA Style

Fan H, Li J, Ge F, Liang C. Speaker Verification Based on Channel Attention and Adaptive Joint Loss. Electronics. 2025; 14(3):548. https://doi.org/10.3390/electronics14030548

Chicago/Turabian Style

Fan, Houbin, Jun Li, Fengpei Ge, and Chunyan Liang. 2025. "Speaker Verification Based on Channel Attention and Adaptive Joint Loss" Electronics 14, no. 3: 548. https://doi.org/10.3390/electronics14030548

APA Style

Fan, H., Li, J., Ge, F., & Liang, C. (2025). Speaker Verification Based on Channel Attention and Adaptive Joint Loss. Electronics, 14(3), 548. https://doi.org/10.3390/electronics14030548

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Speaker Verification Based on Channel Attention and Adaptive Joint Loss

Abstract

1. Introduction

2. Related Works

2.1. Speaker Verification System Framework

2.2. ECAPA-TDNN

3. Channel Attention and Adaptive Joint Loss

3.1. CAA-TDNN

3.2. AAMSoftmax

3.3. SphereFace2

3.4. Loss Optimization

4. Experiment

4.1. Experimental Setup

4.2. Experimental Results and Analysis

4.2.1. Ablation Experiments

4.2.2. Comparison with Existing Methods

4.2.3. The Impact of σ on Model Performance

4.2.4. Performance Comparison of Different Loss Functions

4.2.5. Robustness and Generalization Ability

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI