A Self-Supervised Method for Speaker Recognition in Real Sound Fields with Low SNR and Strong Reverberation

Zhang, Xuan; Tang, Jun; Cao, Huiliang; Wang, Chenguang; Shen, Chong; Liu, Jun

doi:10.3390/app15062924

Open AccessArticle

A Self-Supervised Method for Speaker Recognition in Real Sound Fields with Low SNR and Strong Reverberation

by

Xuan Zhang

¹

,

Jun Tang

¹,

Huiliang Cao

¹

,

Chenguang Wang

²,

Chong Shen

^1,*

and

Jun Liu

¹

School of Instrument and Electronics, North University of China, Taiyuan 030051, China

²

School of Information and Communication Engineering, North University of China, Taiyuan 030051, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(6), 2924; https://doi.org/10.3390/app15062924

Submission received: 7 February 2025 / Revised: 2 March 2025 / Accepted: 4 March 2025 / Published: 7 March 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Speaker recognition is essential in smart voice applications for personal identification. Current state-of-the-art techniques primarily focus on ideal acoustic conditions. However, the traditional spectrogram struggles to differentiate between noise, reverberation, and speech. To overcome this challenge, MFCC can be replaced with the output from a self-supervised learning model. This study introduces a TDNN enhanced with a pre-trained model for robust performance in noisy and reverberant environments, referred to as PNR-TDNN. The PNR-TDNN employs HuBERT as its backbone, while the TDNN is an improved ECAPA-TDNN. The pre-trained model employs the Canopy/Mini Batch k-means++ strategy. In the TDNN architecture, several enhancements are implemented, including a cross-channel fusion mechanism based on Res2Net. Additionally, a non-average attention mechanism is applied to the pooling operation, focusing on the weight information of each channel within the Squeeze-and-Excitation Net. Furthermore, the contribution of individual channels to the pooling of time-domain frames is enhanced by substituting attentive statistics with multi-head attention statistics. Validated by zhvoice in noisy conditions, the minimized PNR-TDNN demonstrates a 5.19% improvement in EER compared to CAM++. In more challenging environments with noise and reverberation, the minimized PNR-TDNN further improves EER by 3.71% and 9.6%, respectively, and MinDCF by 3.14% and 3.77%, respectively. The proposed method has also been validated on the VoxCeleb1 and cn-celeb_v2 datasets, representing a significant breakthrough in the field of speaker recognition under challenging conditions. This advancement is particularly crucial for enhancing safety and protecting personal identification in voice-enabled microphone applications.

Keywords:

low SNR; strong reverberation; self-supervised learning; migratory pre-trained networks; speaker recognition

1. Introduction

Speaker recognition encompasses tasks such as verification, identification, and diarization [1,2,3]. The process involves the analysis of voiceprint features extracted from speech, which are shaped by individual variations in vocal tract structure, pronunciation, and accent [4]. Comparable to fingerprint, facial, and iris recognition, voiceprint recognition offers the benefits of contactless operation and high accuracy. It finds extensive application in access control, user authentication, and forensic evidence collection, and has been integrated into various smart microphone systems, including cellphones, smart homes, and smart vehicles [3,5,6,7,8].

Recently, deep neural networks have achieved significant advancements in speaker recognition [9]. Many studies have focused on enhancing feature extraction by using advanced DNN architectures, such as Time-Delay Neural Networks (TDNNs) [10] and Residual Networks (ResNets) [11]. The X-Vector model, proposed by [10], was introduced as an improvement over the D-Vector model [12]. Unlike the frame-by-frame modeling approach employed in the D-Vector model, the X-Vector model utilizes a pooling layer to capture speech embedding units. It also incorporates data augmentation techniques to enhance the performance of TDNN-based speaker recognition systems. In further research on speaker recognition tasks, feature extraction models based on Deep Neural Networks (DNNs) incorporate various architectures [5,13,14,15,16,17]. For instance, ref. [18] utilizes Convolutional Neural Networks (CNNs) to locally extract features from speech, while [19] employs Recurrent Neural Networks (RNNs) to capture temporal dependencies and patterns within the speech. Specifically, ref. [20] innovated the X-Vector approach by incorporating techniques from computer vision, namely the Squeeze Excitation network, Res2Net, and the channel propagation aggregation mechanism. This integration significantly improved speaker recognition performance, marking it as one of the most effective systems of its time.

In environments with an ideal sound field, which are free from ambient noise, sensing noise, interference from other speakers, and echoes, numerous deep neural network approaches for speaker recognition have demonstrated high performance in critical metrics such as EER and MinDCF. However, in actual sound field conditions, it is typical to experience sudden occurrences of high-energy noise and significant speech reverberation. For instance, in a confined conference room, sounds such as doors opening and closing, coughing, and footsteps are commonly present. Furthermore, in scenarios where the sound source is distant, the SNR between speech and the noise generated by acoustic sensor equipment is also low. This results in a marked reduction in the effectiveness of speaker recognition techniques, as reported in [21]. Preprocessing of the original speech, which includes the extraction of mel-spectrogram, spectrogram, and MFCC (mel frequency cepstrum coefficient), is essential prior to input into the back-end model. The occurrence of noise and reverberation leads to alterations in the speech’s spectral characteristics. To solve this problem, scholars focus on speech preprocessing, feature extraction, loss function design, and the integration of a speech separation frontend [22,23,24,25,26]. For example, ref. [26] introduced the integration of FiLM into a Voice-Filter frontend to enhance the robustness of automatic speech recognition (ASR) and text-independent speaker verification (TI-SV) in noisy and reverberant conditions. These strategies primarily address specific noisy scenarios.

In recent years, self-supervised learning (SSL) has achieved significant success in fields such as computer vision and natural language processing [27,28,29,30,31]. SSL can extract generalized information from unlabeled data without relying on labels, annotations, or structured text [32]. Refs. [32,33,34,35,36,37] successively demonstrated the impressive translational results continuously achieved in ASR tasks using SSL. SSL is not only employed for ASR and phoneme recognition but also extends beyond acoustic character recognition to include tasks such as speaker recognition [36] and emotion recognition [37].

To overcome the challenges posed by noisy speech during preprocessing, ref. [38] introduced WavLM, which integrates a SSL model with a robust speaker recognition model to create a novel cascaded architecture. However, the model exhibits certain drawbacks: (1) Despite incorporating strategies such as mask learning and data augmentations involving speed, volume, and noise, the model’s performance under conditions of strong interference consistently falls short compared to robust baseline models like ECAPA-TDNN [20] and CAM++ [39]; (2) In [32], the base HuBERT required 32 V100 GPUs to train on a dataset of 960 h of audio from the LibriSpeech. This level of resource usage is highly detrimental to the goals of miniaturization and low power consumption in smart devices.

This study introduces a novel TDNN-based architecture incorporating a pre-trained model for robust speaker recognition in challenging noisy and reverberant conditions (PNR-TDNN). The architecture leverages the strengths of both SSL and TDNN. Firstly, the SSL model, inspired by HuBERT, is utilized in this study. Given that the number of pseudo-labels is typically a fixed hyperparameter for SSL models, the Canopy and Mini Batch K-Means++ algorithm is proposed to more accurately ascertain the appropriate number of pseudo-labels, thereby serving as dynamic adaptive filtering and extraction units for subsequent speaker recognition techniques. Secondly, inspired by the methods for extracting neighboring information in visual detection and semantic segmentation tasks [40,41,42,43,44,45,46,47], this work recognizes the importance of processing inter-contextual channel information in speech features. To address this, the ECAPA-TDNN framework is modified by incorporating several enhancements, particularly in the TDNN and pooling layers, to more effectively address the challenges of noise and reverberation.

The article’s contributions are as follows:

(1): The development of a novel training strategy for the SSL model in the presence of strong interference, as well as the introduction of a miniaturized method for the integration of SSL model and speaker recognition model.
(2): The pseudo-label generation mechanism of a typical SSL model is improved through the application of the Canopy and Mini Batch K-Means++ algorithms.
(3): Enhancements are made to the pooling layer and Res2SE module of the ECAPA-TDNN by incorporating multi-head attention-based statistic pooling, cross-channel fusion for residual learning, and a non-average full-channel processing mechanism.

The rest of this paper is structured as follows: Section 2 introduces the typical SSL and speaker recognition model, as well as the proposed PNR-TDNN. Section 3 describes the details from dataset preparation to model training. Section 4 analyzes and compares the results of multiple sets of experiments. The conclusions are summarized in Section 5.

2. Background and Proposed Method

2.1. Problem Description

The speaker verification task is characterized by the use of a labeled dataset, which is partitioned into training and test subsets. The subsets are created by matching speech segments with their corresponding labels. Following the training phase, the speeches from the test dataset are inputted into the model to obtain prediction vectors. The similarity between these vectors is assessed using an appropriate metric to ascertain the likelihood of their origin from the same speaker. Most traditional speaker recognition systems employ MFCCs as their primary features. As illustrated in Figure 1, however, the presence of strong noise and reverberation significantly interferes with the speech signal, posing challenges to the effectiveness of MFCC-based recognition. The complexity of noise sources makes their elimination challenging, and the differentiation of reverberation from the clean signal is difficult [48]. Consequently, bypassing the spectrogram and employing an alternative initial feature extractor may be considered as a potential solution. This paper proposes employing a small amount of labeled data in conjunction with a large amount of unlabeled data, which is obviously more applicable. This results in a model with greater generalization and migration capabilities.

2.2. Speech Model of SSL

HuBERT is derived from the integration of BERT [29] and wav2vec [35]. Its architecture is characterized by a multilayer CNN encoder, Transformer [49] encoders, and a projection layer. The parameter details for the base HuBERT are presented in Table 1.

The network architecture is shown in Figure 2. An input speech of length L is processed by the CNN encoder, which consists of seven convolutional layers, resulting in T (T = L/320) feature vectors with 512 channels. The T feature vectors are categorized into two types: masked feature vectors

\tilde{x}

, where the masking process involves randomly selecting the starting point for the mask and setting 10 consecutive features to a value of 0, and unmasked feature vectors, x. The masked and unmasked features are processed by the Transformer encoders, resulting in the output feature sequences [

o_{1 m}

, …,

o_{T_{1} m}

] and [

o_{1 u m}

, …,

o_{T_{2} u m}

], respectively. The sum of

T_{1}

and

T_{2}

equals the total length T, with t representing the feature index. The Transformer encoders utilize both global and local information, maintaining strict consistency in data dimensions between input and output.

The subsequent step involves the calculation of the loss function. Initially, we need to establish the correspondence between the output

(o_{t})

and the pseudo-labels (z), which are associated through a probability density function. In this context, each pseudo-label (z) (where the total number of classifications is C) is assigned to a specific codebook. Pseudo-labels with identical values are mapped to the same codebook. The probability density functions for the variables

(o_{t m}

and

o_{t u m}

) with respect to the pseudo-label (z) are given by Equations (1) and (2), respectively.

p_{m} (z, c| \tilde{x}, t) = \frac{exp (s i m (P \cdot o_{t m}, e_{c}) / τ)}{\sum_{c^{'} = 1}^{C} exp (s i m (P \cdot o_{t m}, e_{c^{'}}) / τ)}

(1)

p_{u m} (z, c| x, t) = \frac{exp (s i m (P \cdot o_{t u m}, e_{c}) / τ)}{\sum_{c^{'} = 1}^{C} exp (s i m (P \cdot o_{t u m}, e_{c^{'}}) / τ)}

(2)

The projection matrix P is back-propagable;

e_{c}

denotes the codebook corresponding to label z, and

e_{c^{'}}

represents the standard codebook. The variable c indicates a specific classification. The scale parameter

τ

is utilized for adjustment. The function sim(·,·) calculates the cosine similarity between two tensors.

p_{m}

represents the probability density function derived from the masked

\tilde{x}

, while

p_{u m}

denotes the probability density function from the unmasked x. Cross-entropy loss function is employed. The loss functions

L_{m}

and

L_{u m}

, derived from x with the corresponding pseudo-label, are defined by Equation (3) and Equation (4), respectively. The final loss function, as shown in Equation (5), comprises

L_{m}

and

L_{u m}

, where

λ

represents an adjustment parameter.

L_{m} = - \sum_{t = 1}^{T_{1}} log p_{m} (z, c| \tilde{x}, t)

(3)

L_{u m} = - \sum_{t = 1}^{T_{2}} log p_{u m} (z, c| x, t)

(4)

L = λ \cdot L_{m} + (1 - λ) \cdot L_{u m}

(5)

Notably, HuBERT employs an ensemble of K clustering methods, leading to a multinomial adjustment of the loss function, as demonstrated in Equations (6) and (7).

L_{m} = - \sum_{k = 1}^{K} \sum_{t = 1}^{T_{1}} log p_{m}^{k} (z, c| \tilde{x}, t)

(6)

L_{u m} = - \sum_{k = 1}^{K} \sum_{t = 1}^{T_{2}} log p_{u m}^{k} (z, c| x, t)

(7)

During the subsequent fine-tuning phase of the ASR task, the projection matrix P is not updated, and the parameters of the CNN encoder are frozen. Only the Transformer encoders are subjected to parameterization, utilizing connectionist temporal classification (CTC) as the loss function. In the HuBERT strategy, MFCC is initially employed for clustering to generate pseudo-labels, which serve as the target for the initial training epoch. Subsequently, the output from the 6th layer of the Transformer encoders in HuBERT is utilized for clustering purposes, yielding pseudo-labels that facilitate the second phase of training.

In contrast to HuBERT, WavLM enhances the training process by initially introducing additive noise and additional speaker speech to pure speech. It incorporates a gate-relative position bias into the self-attention mechanism of the Transformer architecture. The pre-training strategy is adapted to include noise-contaminated and overlapping speech as input, while still utilizing pure speech for the computation of MFCC-based clustering pseudo-labels.

2.3. Speaker Verification Model

Speech features, including MFCC or Mel-Spectrograms, are extracted through the processing of audio by a characterizer. Subsequently, these extracted features are utilized as input for various neural networks to derive deep embedding features. In the final step, the likelihood of the speaker’s identity is determined through the application of an activation function. Among the commonly employed baseline models in the field are TDNN, ECAPA-TDNN, Res2net, Resnet-SE, Eres2net, and CAM++. The ECAPA-TDNN employs several advanced approaches, including channel-dependent and context-dependent statistics pooling, 1-dimensional Squeeze Excitation Res2Blocks, and a multi-layer feature aggregation and summation mechanism. CAM++ employs the D-TDNN [50] as its backbone architecture, incorporating strategies like context-aware masking.

2.4. Proposed Method Overview

This study proposes a speaker verification method for real-work scenarios: (1) It employs a self-supervised learning-based pseudo-labeling approach for speech corrupted by noise and reverberation. Contrary to HuBERT, pure speech is corrupted with a strong additive noise component to produce corrupted speech for subsequent input. Additionally, HuBERT incorporates a random masking operation into the input, which represents the second phase of interference addition. This is done to mimic the actual acoustic characteristics of a noisy environment. The ultimate objective is to predict pseudo-labels for pure speech. Additionally, the prediction of pseudo-labels is extended to scenarios involving strong reverberation, targeting reverberant speech. (2) To integrate the pre-trained model with the robust speaker verification architecture, the extracted audio features from the last Transformer encoder are adapted through a 1-dimensional convolution layer to a compatible format for the verification network. (3) ECAPA-TDNN is enhanced in this study through the integration of a convolutional aggregation mechanism inspired by U-Net, which facilitates the fusion of contextual features within the TDNN sub-network. The integration of attentive average pooling into the Squeeze Excitation module enables the TDNN subnet to adaptively and discriminately learn channel-wise weight information. This modification enhances the network’s capability to extract speech features even under strong noise interference across varying time intervals. Lastly, we incorporate a multi-heads attentive statistics mechanism into the pooling factor

α

, enhancing the network’s resilience to interference from low and medium-frequency noise after pooling.

2.4.1. Network Architecture

The proposed model architecture comprises two components. The first component, as illustrated in Figure 3, integrates a CNN encoder with Transformer encoders, incorporating a projection matrix. This projection matrix is utilized exclusively during the pre-training phase. The Transformer encoders consist of 12 layers, each performing self-attention operations. Each layer has an output channel dimension of 768, with 12 attention heads, and a projection dimension of 256 for the loss function. The second part shown in Figure 4 depicts details of the improved ECAPA-TDNN. Among them, Res2net-U, SE-attentive mean, and multi-heads attentive statistics pooling are shown in Figure 4b, Figure 4c and Figure 4d, respectively. The PNR-TDNN and ECAPA-TDNN share similarities in their capability to process speech segments of varying lengths. Signals with a duration exceeding 0.5 s are used for training and testing. When the signal length exceeds 3 s, it is truncated to 3 s.

2.4.2. Canopy/Mini Batch K-Means++

In the application of K-means [51] clustering to HuBERT/WavLM, the determination of the number of clusters and the identification of initial cluster centers can present challenges, particularly when large-scale datasets are involved. Figure 5 and Algorithm 1 illustrate the approach adopted in this study, which employs Canopy [52] and Mini Batch k-means++ [53] to address these issues. The approach is as follows: (1) Set initial distance thresholds

T_{1}

and

T_{2}

(

T_{1} > T_{2}

). Calculate the Euclidean distance between all data and thresholds. Determine whether new clusters are formed and find the appropriate number of clusters K. (2) Assuming that n initial clustering centers have been selected (

0 < n < K

), points that are further away from current centers will have a higher probability of being selected as the (

n + 1

)th cluster center until selection of n clustering centroids is completed. (3) A portion of the dataset is randomly selected and allocated to clusters based on proximity to their centroids. The centroids are iteratively updated until stability is achieved or the maximum number of iterations is reached.

Algorithm 1: Canopy/Mini Batch k-means++ algorithm

2.4.3. Fusion Res2net and SE-Attentive Mean

In Figure 6a, the input is divided into n feature groups by Res2net. The i-th feature is fused with the (

i + 1

)th feature through a 3 × 3 convolutional residual connection, except for the first feature, which is preserved in an unchanged state. Inspired by multi-rate tracking in navigation system [54], this study utilizes the multi-scale fusion capability of the U-net architecture. Figure 4b illustrates the convolutional fusion process within the U-net architecture. In the downsampling phase, features are combined by applying a 3 × 3 convolution, where the ith-level feature is integrated with the (

i + 1

)th level feature. Conversely, during the upsampling phase, the ith-level feature is concatenated with the (

i - 1

)th-level feature through a similar 3 × 3 convolutional process. Furthermore, at the deepest level (nth feature), convolution is applied, and the result is directly propagated to the subsequent layer. This mechanism facilitates the integration of contextual information and the fusion of neighboring channels.

Equations (8)–(10) illustrate the three steps executed by the SE (Squeeze Excitation) module within the ECAPA-TDNN framework. (1)

x (B, F, T)

obtains

x_{m e a n} (B, F, 1)

in the T dimension mean, as shown in Figure 4c, “Mean Parameter”; (2)

x_{m e a n}

is squeezed and excited to obtain weight coefficients for all F channels

x_{f} (B, F, 1)

. The linear operation corresponding to

W_{1}

and

b_{1}

represents squeeze, whereas

W_{2}

and

b_{2}

represent excitation, as shown in Figure 4c, “Fully Connected”; (3)

x_{f}

is extended to

x_{f t}

(in T dimension), and subsequently,

x_{f}

is re-weighted with x to obtain

\tilde{x} (B, F, T)

, as shown in Figure 6c, “Re-weight”.

x_{m e a n} = \frac{1}{T} \sum_{t = 1}^{T} x_{t}

(8)

x_{f} = s i g m o i d (W_{2} \cdot r e l u (W_{1} \cdot x_{m e a n} + b_{1}) + b_{2})

(9)

\tilde{x} = x_{f t} \cdot x

(10)

This study introduces an attentive channel mean mechanism to replace the traditional mean operation, acknowledging the varying contributions of different frames to the speaker’s message, particularly in the differentiation between vocalized and silent states. Concretely,

x (B, F, T)

is sequentially processed through compression, recovery, and probability estimation to yield the attentive channel mean score

α

.

In Equation (11), x undergoes squeeze and recovery operations along the F channel. The linear transformations corresponding to weight matrix

W_{3}

and bias vector

b_{3}

denote the squeeze operation, while

W_{4}

and

b_{4}

denote the recovery operation, as illustrated in Figure 4(c1), “Fully Connected”. The hyperbolic tangent function,

t a n h

, is used as the activation function. After the above operation, we further obtain the linear relationship between all F channels. Subsequently, the Softmax activation function is applied to the T channel, resulting in the fractional distribution of different time frames. Equation (12) illustrates that

α

is element-wise multiplied with x, followed by the aggregation of the resulting values across the T channel. Conclusively, the optimized mean feature

{\tilde{x}}_{m e a n}

is derived, as shown in Figure 4(c1), “Hadamard Product+Sum”.

α = S o f t m a x (W_{4} \cdot \tanh (W_{3} \cdot x + b_{3}) + b_{4})

(11)

{\tilde{x}}_{m e a n} = \sum_{t = 1}^{T} {α_{t} \cdot x}_{t}

(12)

2.4.4. Multi-Head Attentive Statistics Pooling

Analysis of MFCC features extracted from the original speech reveals that noise is predominantly distributed in the low-frequency and mid-frequency bands, while speech is predominantly found in the low-frequency band. Upon substituting the MFCC extractor with a pre-trained model, the noise in the newly extracted features is also characterized by a full-channel distribution. Attentive statistics pooling in ECAPA-TDNN is designed to capture global channel information, which, however, may lead to heightened noise interference in the pooling outcomes. To address this issue, this work introduces a modification involving the replacement of the attentive statistics pooling with a multi-head attention-based pooling mechanism [55]. This approach divides the entire channel space into several autonomous sub-channel spaces, aiming to reduce mutual interference.

Specifically, the mean and standard deviation of the output features from the final TDNN layer are computed initially. As shown in Figure 4d “Catenate”, the resulting feature, after aggregating the above three, is defined as

o (d, T)

. Here, d can be understood as the frequency band and T as the time. o is separated into k non-overlapping subspaces

(x (d / k, T))

along the channel dimension d, as shown in Figure 4(d1), “Chunk”. Among these subspaces,

x_{t}

for a T channel dimension in x is defined as

[x_{t}^{1}, x_{t}^{2}, \dots, x_{t}^{k}]

(1 \leq t \leq T, 1 \leq i \leq k, x_{t}^{i} \in R^{d ⁄ k})

. Similarly,

x^{i}

for a d channel dimension in x is defined as

[x_{1}^{i}, x_{2}^{i}, \dots, x_{T}^{i}]

. Next, unlike the channel-independent self-attention scoring mechanism in [20], the channel-isolated self-attention scoring mechanism is used here. The corresponding scoring factor e for each sub-channel space (in d channel) is calculated by using Equation (13).

e = W_{6} \cdot tanh (W_{5} \cdot x + b_{5}) + b_{6}

(13)

W_{5} \in R^{F (d ⁄ k)}

and

W_{6} \in R^{F \times F}

are linear unit operations,

b_{5}

and

b_{6}

are their corresponding bias variables, as shown in Figure 4(d1) “Fully Connected”. Then, the Softmax function (Equation (14)) is utilized to compute the importance percentage scores across various T channels. For each dimension of the d channel, the score is denoted as

β^{i}

, which represents the importance distribution of data frames across different T channels within each d channel dimension.

β^{i} = \frac{exp (e_{j}^{i})}{\sum_{j^{'} = 1}^{T} exp (e_{j^{'}}^{i})}

(14)

β

is concatenated along the channel dimension d from the

β^{i}

components. Subsequently,

γ

is constructed by concatenating the

β

across the k subspaces, as depicted in Figure 4(d1) “Catenate”.

γ_{t}

, representing the T channel dimension within

γ

, is defined as the concatenation of

[γ_{t}^{1}, γ_{t}^{2}, \dots, γ_{t}^{k}]

, where each

γ_{t}^{i}

is a component of the respective subspace. Due to the specificity of different d channel dimensions, the mean

\tilde{μ}

and variance

\tilde{σ}

, obtained through multi-head attentive statistics pooling, are formulated in Equations (15) and (16), also as shown in Figure 4d, “Hadamard Product”, “mean”, and “std”. The pooling output is derived from the concatenation of

\tilde{μ}

and

\tilde{σ}

.

\tilde{μ} = \sum_{t}^{T} γ_{t} \cdot x_{t}

(15)

\tilde{σ} = \sqrt{\sum_{t}^{T} γ_{t} \cdot {x_{t}}^{2} - \tilde{μ}}

(16)

3. Experimental Setup

3.1. Dataset Generation

Voice data for this study were obtained from the open-source Chinese speech corpus, zhvoice, which comprises eight datasets. The corpus, following noise reduction and silence removal, includes more than 3200 speakers and approximately 900 h of audio, all recorded at a 16 kHz sampling rate. Two versions of the zhvoice corpus are adapted for this paper, tailored to conference room conditions. One version simulates scenarios with both noise and reverberation (herein referred to as NR-zhvoice), while the other version represents conditions with noise only (referred to as N-zhvoice). For NR-zhvoice, the simulation of reverberant speech is a preliminary step. The room impulse response is simulated using the wham_room library [56], where the configuration of parameters for reverb is shown in Table 2. In addition, wham_room utilizes a dual-source and dual-microphone configuration by default, in contrast to the minimalist speaker verification approach, which operates with monaural audio. Consequently, identical sets of sound sources and microphone parameters are input into wham_room, with only the left channel of the first microphone being employed for reverb output.

In the second processing step, reverberant speech is corrupted by adding a randomly chosen noise at a randomly determined SNR within the range of [0, 0.5] dB. In cases where the noise duration is shorter than that of the speech, the noise is looped or extended by concatenating its beginning to ensure matching lengths. The noises are derived from two primary collections. The first is the soundsnap website, which encompasses a diverse array of sounds, including air conditioners, fans, cell phone vibration, coughing, sneezing, walking, flipping through documents, keyboards, and more. The dataset includes 1950 records across twelve categories. Additionally, as shown in Figure 7 and Figure 8, actual recordings (A total of 480 records are sampled at 16 kHz, all lengths are cropped to 3 s) were made in a working conference room using a U-PHORIA UMC202HD sound sensor manufactured by Behringer. N-zhvoice is generated by a process that follows the second step of NR-zhvoice, involving the addition of noise while excluding reverberation to the zhvoice.

3.2. Hyperparameter Settings

The speech from N-zhvoice and NR-zhvoice are used as input for the pre-trained model. Meanwhile, the k-means clustering value of MFCC from zhvoice/reverberation zhvoice dataset serves as the iterative target. A sub-dataset of the N-zhvoice/NR-zhvoice is used as the training data for the speaker verification model, with the speaker classification labels designated as the objective for learning. The pre-trained model is trained for 15 epochs, after which the speaker verification network is further trained for an additional 60 epochs. The proposed method, implemented in Python 3.8 using PyTorch 2.0.0, was executed on an Ubuntu 20.04 system equipped with two pieces of RTX 4090 GPUs and one piece of Intel(R) Xeon(R) Gold 5218R CPU (Intel, Santa Clara, CA, USA).

(1): Training details and settings of the pre-trained model: For MFCC, the frame-length is 25 ms, the frame-shift is 10 ms, the sampling frequency is 16 kHz, and the MFCCs are 39. The CNN encoder utilized a downsampling rate of 1/320. The values for both mask-prob and mask-channel-prob are both set to 0.9, with an approximately 1:1 ratio of masked to non-masked features. The mask length is set to 10 and the learning rate is set to 5 × 10⁻⁴, weight decay to 1 × 10⁻², and the optimizer employed is Adam with beta parameters randomly sampled from the range (0.9, 1.8). The dropout rate is fixed at 0.1.
(2): Training details and network settings of TDNN: The model is trained with a learning rate of 1 × 10⁻³ and weight decay of 1 × 10⁻⁶, using Adam as the optimization algorithm, and 512 feature channels. The batch size is set to 12, and the maximum duration of audio clips is limited to 3 s. For the loss function, the Angular Additive Margin (AAM) [57] is employed with a margin of 0.2 and a scale factor of 30.

3.3. Evaluation Metrics

The pre-trained model utilizes two datasets, N-zhvoice and NR-zhvoice, each comprising 1,090,000 unlabeled samples. Approximately 5% of these data are randomly selected and labeled for use in subsequent speaker verification tasks. Of the labeled data, 1∼1.5% is designated as the test set, with the remainder allocated for training purposes. The evaluation criteria for the speaker verification tasks encompass the Equal Error Rate (EER) and the Minimum Detection Cost Function (MinDCF). Associated metrics include the False Acceptance Rate (

F A R

), the False Rejection Rate (

F R R

), and the threshold settings. The EER is defined as the point at which the FAR and FRR are equal, reflecting a balance between the two types of errors at different threshold settings. The formula for computing MinDCF is provided in Equation (17).

D C F = C_{F R R} \times F R R \times P T + C_{F A R} \times F A R \times P I

(17)

The costs associated with false rejection (

C_{F R R}

) and false acceptance (

C_{F A R}

) are both set as 1, while the prior probabilities for the real speaker (

P T

) and the impostor speaker (

P I

) are assigned as 0.01 and 0.99, respectively.

The metrics are computed through the following steps: (1) acoustic features, extracted either by a MFCC extractor or a pre-trained model, are input into a speaker recognition model to derive classification features; (2) the cosine similarity between the classification features for each audio is calculated and designated as the target score; (3) the target label and the corresponding target score are compiled into a score list; (4) the threshold is incremented from 0 to 0.99 in steps of 0.01. For each threshold, the

F R R

and

F A R

are determined. Subsequently, the EER and MinDCF values are computed based on these rates.

4. Results and Comparison

This section discusses the convergence of the pre-trained model, the performance of speaker verification models utilizing MFCC-based input features, and the performance of PNR-TDNN with a self-supervised learning-based feature extractor.

4.1. Pre-Trained Model Experiments

As shown in Figure 9, there were 100 clusters when

T_{2}

took 165. In general, an increase in

T_{2}

is associated with a decrease in the size of the clusters, whereas the impact of

T_{1}

on cluster formation is minimal. Figure 10 illustrates the convergence of pre-trained model associated with varying values of MFCC label clusters: (1) Convergence of the pre-trained model becomes particularly challenging when the clusters value is increased. It is evident that WavLM achieves convergence more readily compared to HuBERT. (2) Both HuBERT and WavLM face increased difficulty in fitting the data as the number of clusters grows. From the perspective of model refinement, a larger number of clusters generally corresponds to improved back-end modeling capabilities. (3) It is crucial to select appropriate values for

T_{2}

and

T_{1}

to accurately determine the number of clusters. In the subsequent experiment, the optimal number of clusters was found to range between 60 and 90 when

T_{2}

is set to approximately 180. This finding serves as a useful guideline for selecting the number of clusters in similar scenarios.

4.2. PNR-TDNN Experiments

Unlike models that utilize MFCCs for speaker verification, the PNR-TDNN employs the output features of a pre-trained network for the speaker verification task. These features, with their higher dimensionality, provide increased flexibility and enable adaptability across diverse application scenarios. This results in slower convergence rates in recognition networks, so training epoch is set to 60.

4.2.1. Discussion of Baselines

To facilitate a fair comparison, the experiments in this study utilize a subset of the data referred to in Section 3.1. These data are further annotated and include instances of noise interference as well as combined noise and reverberation interference. Specifically, the sub-datasets from N-zhvoice/NR-zhvoice for speaker verification are presented in Table 3. The baseline model comprises TDNN, Resnet-SE, Res2net, Eres2net, ECAPA- TDNN (512 channels), CAM++ (32 channels), along with other state-of-the-art methods within the last two years (as shown in Table 4, such as MFA-Conformer, Wespeaker, RedimNet, Gemini). All baseline models exhibit a satisfactory fit within 30 epochs, which is why the training is limited to this number of epochs.

For dataset A, as presented in Table 4, CAM++ demonstrates superior performance compared to other methods with respect to both EER and MinDCF. RedimNet achieves the second lowest EER. In the case of datasets B and C, CAM++ achieves a notable lead in EER, whereas ECAPA-TDNN and Resnet-SE yield the optimal MinDCF results. Additionally, the results following the addition of reverberation interference are worse than those with noise alone, indicating the detrimental effect of reverberation on the speaker recognition task. This study argues that the trailingphenomenon in the frequency domain impacts the network’s capability to process information effectively. The computational complexity of the model is measured in terms of floating-point operations (FLOPs) and is assessed using a CPU (Intel(R) Xeon(R) Gold 5218R CPU@2.1GHz, the batch size for these evaluations is fixed at 12). The real-time factor (RTF) is a metric used to gauge the efficiency of the model’s processing time, defined as the ratio of the time taken by the model to process a single audio to the duration of the audio itself.

4.2.2. Discussion of Clusters and Compressed Models

Table 5 presents the experimental results of various PNR-TDNNs. We found that an increased number of MFCC clusters corresponding to pseudo-labels within the pre-trained model lead to enhanced effectiveness in the back-end recognition model. When the number of clusters exceeds 100, HuBERT experiences gradient overflow within the first three epochs. In this case, the corresponding PNR-TDNN exhibits significantly weak recognition performance. Therefore, the optimal strategy is to select a maximum number of clusters that ensures the convergence of the pre-trained model, balancing the model’s capacity with its ability to learn effectively. In addition, recognition performance is at its peak when the cluster number is set to 80. Based on the methodology described in Section 2.4.2, which involves Canopy and Mini Batch k-means++, it is determined that for the specific application scenario, when the values of

T_{1}

and

T_{2}

are set at 400 and 180, respectively, the optimal number of clusters is obtained.

In the case of noise-only interference, the EER difference between Hu + EC(80) and Wa + EC(80) is negligible, whereas the difference in MinDCF is notable. Hu + EC(80) exhibits superior performance in both metrics under conditions of combined noise and reverberation. Our study suggests that the ambiguity introduced by full-channel reverberation makes use of gated relative position bias that is inappropriate for PNR-TDNN. Additionally, the results for Hu + CA(80) consistently demonstrate lower performance compared to Hu + EC(80) across all evaluated metrics and computational factors. Consequently, Hu + EC(80) is designated as the reference implementation for PNR-TDNNs in subsequent experiments.

Another key issue is that Hu + EC occupies 400 M, which is a very large amount that is unacceptable and cannot be deployed for smart microphone devices. As shown in Table 6, we reduced the parameters for HuBERT’s Transformer encoder and feed forward net. The minimized HuBERT + EC (T-HuBERT + EC) is only 90 M, which is in the same order of magnitude as ECAPA-TDNN and CAM++. In subsequent experiments, small HuBERT + EC and tiny HuBERT + EC outperform or almost equalize base HuBERT + EC in both EER and minDCF. This indicates that base HuBERT has not been fully utilized in pre-training.

The pseudo-labels for pre-trained model carry a certain degree of uncertainty, and their generation and application methods are far from comparable to real labels. Of course, this is also advantageous, as they can better learn fuzzy information and adapt to various downstream tasks. However, the biggest disadvantage of fuzzy information is that it cannot refine learning, and although base HuBERT contains 12 layers of Transformer encoders, the goal is too vague, and the multi-layered and repeated cascading network structure cannot obtain more useful information. Therefore, it is feasible to simplify the number of encoder layers and reduce the feature dimension appropriately within the narrow range of SNR.

Finally, the computational complexity analysis of the proposed method can be seen in Table 5, and a faster speed of 75 speech features per second can be achieved in one part of Intel(R) Xeon(R) Gold 5218R CPU.

4.2.3. Discussion of PNR-TDNN Ablation Experiments

Table 7 presents the performance of the proposed method, which is discussed in detail. The experiments are numbered for ease of reference. Experiments 1–3 modify the residual module of the Hu + EC(80), with the various residual networks depicted in Figure 6a, Figure 6c and Figure 6b, respectively. Experiment 4 introduces the SE-attentive mean mechanism. Experiment 5 employs the multi-head attentive statistics pooling mechanism. In experiment 6, we combine the SE-attentive mean with the multi-head attentive statistics pooling mechanism, integrating these enhancements into the Hu + EC(80). Experiments 7–9 are the enhanced versions achieved after applying fusion Res2net, SE-attentive mean, and multi-head attentive statistics pooling mechanism.

In comparison to the joint training approach where the pre-trained model and back-end model are trained together, as described in references [32,38], our findings indicate that the entire model tends to converge more effectively when the pre-trained model is entirely frozen during the training process.

For experiments under noise-only interference conditions: (1) Employing pseudo-labels based on MFCC clusters of clean speech significantly enhances the performance of the PNR-TDNN architecture, surpassing that of the established baselines. (2) Figure 11a reveals that the EER for each PNR-TDNN experiences a marked decline beyond 30 epochs. Conversely, the MinDCF continues to refine. Research indicates that the points of optimal performance for EER and MinDCF are not reached concurrently.

For experiments conducted under noise and reverberation interference conditions, our findings were as follows: (1) The PNR-TDNN consistently outperforms the baselines only when MFCC clusters with reverberation are utilized as pseudo-labels. (2) When the extended version of dataset A, comprising 66,000 utterances with noise and reverberation interference, is utilized as the training set, an improvement in the EER of the test set is challenging to achieve. However, when dataset A is reduced to create dataset B (51,000 utterances) and dataset C (30,000 utterances), a notable reduction in EER is observed. The study suggests that the limitation of pre-trained models in capturing concrete information, which is less specific than MFCC features, is magnified by the presence of an excessive number of labels during PNR-TDNN training. (3) Figure 11b,c illustrate that the EERs for datasets B and C remain consistent. The slower convergence of the loss function indicates that the pre-trained model exhibits robust resistance to over-fitting. The resilience enables the PNR-TDNN to progressively enhance its efficacy in the latter phase of the training process, a trend that shares similarities with the performance characteristics observed in CAM++.

In comparison to the best model (CAM++) discussed in Section 4.2.1, experiment 8 yields an EER improvement of 9.54% while almost maintaining a similar MinDCF on dataset A. On dataset B, an 11.13% EER improvement is observed, accompanied by a slight improvement in the MinDCF. For experiment 7 on dataset C, a 23.77% EER improvement and a 1.88% enhancement in the MinDCF are achieved. These results suggest that the enhancement mechanisms proposed in this paper demonstrate superior performance in speaker verification tasks under conditions of strong noise and reverberation.

In addition, the EER of experiment 8 is slightly better than experiment 9, indicating that larger pre-trained models still have a few advantages in backend applications, but the improvement effect is not significant. For example, for dataset A, experiment 8’s EER improves by 4.5% compared to that of experiment 9. Considering the miniaturization requirements of smart microphone devices, choosing smaller pre-trained model is still the preferred choice.

Finally, we performed significance tests on the model. Specifically, five cross-training sessions were conducted using dataset A, and then the t-test was cross-validated using the mean and variance of the calculated EER. As shown in the Figure 12, this reflects that the five models are statistically significantly different, and also shows that our proposed model has a very uplifting effect with respect to the other four baseline models, and improvements are not due to random variations.To better present the results, as shown in Figure 13, we present the best EER comparison plots of the four baseline models with the proposed T-Hu+EC+FAM on dataset A, dataset B, and dataset C.

4.2.4. Discussion of Other Noisy Situations

To assess the generalization capability of the PNR-TDNN, ESC50 (primarily environmental sounds) and a subset of cn-celeb_v2 (mainly Chinese speech) are selected to generate noise interference for the zhvoice. The ESC50 encompasses animal sounds, natural phenomena, water noises, human non-verbal cues, as well as indoor and domestic sounds. These closely resemble the noise sources used in Section 4.2.1 (soundsnap and actual conference room recordings), enhancing the realism of the testing conditions. Furthermore, the impact of human speech interference is significant and cannot be overlooked. To simulate this, a random selection of speech segments from the cn-celeb_v2 dataset is used as a second noise source. The speech in zhvoice and cn-celeb_v2 is sourced from different individuals, reflecting realistic conditions in which multiple speakers may be present.

Table 8 reveals that PNR-TDNNs outperformed all baselines in terms of recognition accuracy when exposed to ESC50 noise. Despite the presence of vocal interference, the S-Hu + EC + FAM yields robust recognition outcomes. In addition, S-Hu + EC + FAM substantially outperforms Hu + EC + FAM on all three datasets, which corroborates the conclusion in Section 4.2.2 that the small pre-trained model is more effective at capturing speech feature information in a strongly noisy environment with a narrow range of SNR. This demonstrates the strong generalization capability of the proposed method in diverse scenarios, thereby showcasing the model’s adaptability to different conditions.

4.2.5. Discussion of Other SNR’s Situations

The SNR used in previous sections to characterize the noise level ranges from 0 to 0.5 dB, which is a very narrow range. This means that through repeated training, the network may cleverly avoid the influence of noise (in Section 2.1, we found that noise energy is more easily dispersed than speech energy). Therefore, we supplemented the comparative experiments with a larger range of SNR. In the production of noisy speech, the SNR is in narrow (0 to 0.5 dB) and wide (−5 to 3 dB) ranges. The SNR was chosen at random, i.e., in the form of a uniform distribution. More than 60% of these speeches were in the negative SNR condition, and less than 40% were in the lower positive SNR. These conditions are closer to the complex and challenging real sound field, and we list the experimental results in Table 9.

On dataset A, the larger Hu + EC + FAM has the best EER, with an improvement of 12.6% with respect to CAM++, while both S-Hu + EC + FAM and T-Hu + EC + FAM also present a stronger performance than CAM++. This phenomenon suggests that the larger the pre-trained model is, the better the adaptation to the back-end task. As for dataset B and C, the EER of Hu + EC + FAM improves 5.2%/6.12% relative to RedimNet, while the performance of CAM++ is not stable enough to be used as a baseline. Furthermore, the EER of S-Hu + EC + FAM and T-Hu + EC + FAM are both slightly weaker than RedimNet.

In narrow-range SNR, due to the weakening of the noise by fusion Res2net and SE-attentive mean, it is very likely that the noise effect will be ignored by implicit noise suppression. However, smaller Hu + EC + FAM can easily outperform the baseline for milder general acoustic field environments, but the large pre-trained model still has a significant advantage when faced with the challenge of noise intensity greater than speech intensity and a very wide range of SNR.

When exposed to stronger noise energy and a wide range of SNR, the large pre-trained model is still able to extract more feature information about a particular speaker through its deeper network parameters; this finding is significantly different from the experimental results obtained for the narrow-range SNR.

4.2.6. Discussion of Other Public Datasets

The PNR-TDNNs are also evaluated in the well-known dataset of VoxCeleb1 [62] and cn-celeb_v2 [63]. Notably, VoxCeleb1 and cn-celeb_v2 contain a large amount of noise. To match the initial conditions with the zhvoice, a total of 300 noise-free and reverberation-free speakers are meticulously selected from VoxCeleb1, amounting to 100,482 utterances for pre-training (denoted as VoxCeleb1-S). Similarly, 122 noise-free and reverberation-free speakers are chosen from cn-celeb_v2, resulting in a total of 54,272 utterances for pre-training (denoted as cn-celeb_v2-S).

To assess the performance of the PNR-TDNNs, two subsets of VoxCeleb1-S are randomly selected and labeled. As described in Table 10, these two subsets are named VoxCeleb1-A and VoxCeleb1-B, respectively. The same processing steps described above are also applied to the cn-celeb_v2-S. Table 11 shows the evaluation results in VoxCeleb1-A/B and cn-celeb_v2-A/B. As shown in Figure 14, we also present the best EER comparison plots of the four baseline models with the proposed T-Hu+EC+FAM on dataset VoxCeleb1-B and dataset cn-celeb_v2-B.The performance of CAM++’s EER on VoxCeleb1 and cn-celeb_v2 is erratic, which may be attributed to the more complex acoustic environments encountered in these datasets. Several new approaches, including MFA-Conformer, yield suboptimal EER compared to the ECAPA-TDNN.

In evaluation experiments with VoxCeleb1 with only strong noise interference, small PNR-TDNNs presented optimal EER, e.g., T-Hu + EC + FAM achieved a 32% improvement in EER relative to ECAPA-TDNN. However, in the strong reverberation and noise situation, reverberation seriously interfered with speaker identification accuracy, although the PNR-TDNNs still had the lowest EER relative to numerous baselines. The effectiveness of the proposed method in the English dataset is indicated by these results. As for cn-celeb_v2 s2, the proposed method continued to yield the best recognition performance, albeit with a relatively smaller improvement. This can be attributed to the inherently stronger noise present in the original cn-celeb_v2, which negatively impacts the feature extraction capabilities of the backend network.

5. Conclusions

This paper proposes PNR-TDNN, a speaker verification method designed to handle challenging conditions such as strong noise and reverberation interference. This study improves upon the current model by applying a new training strategy to the SSL model, introducing multi-head attentive statistics pooling, cross-channel fusion of residuals, and non-average full-channel mechanisms into the embedded extraction units, such as the pooling layer and SE-Res2 block of ECAPA-TDNN, respectively. A miniaturization method for PNR-TDNN is also proposed and tested. The proposed method has been tested on several datasets characterized by strong noise and reverberation interference. The experimental results demonstrate varying levels of reduction in EER, thereby confirming the effectiveness of the PNR-TDNN approach. Further research is anticipated to extend the application of PNR-TDNN to the domain of speaker identification.

Author Contributions

Conceptualization, X.Z.; methodology, X.Z. and C.S.; software, X.Z.; validation, X.Z. and C.S.; formal analysis, X.Z. and C.S.; investigation, X.Z. and C.S.; resources, J.T. and J.L.; data curation, X.Z. and C.S.; writing—original draft preparation, X.Z.; writing—review and editing, X.Z., C.S., H.C. and C.W.; visualization, X.Z.; supervision, J.T. and J.L.; project administration, C.S.; funding acquisition, J.L., J.T., C.S., C.W. and H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by Innovative Research Group Project of National Natural Science Foundation of China (Nos. 51821003), the Excellent Youth foundation of Shanxi Province (Nos. 202103021222011), the Key research and development project of Shanxi Province (Nos. 202202020101002), the fundamental research program of Shanxi Province (Nos. 202303021211150), the Aviation Science Foundation (Nos. 2022Z0220U0002), the Shanxi province key laboratory of quantum sensing and precision measurement (Nos. 201905D121001), 173 Foundation (Nos. 2021XXXX0668), ZBZDJCYJKT (Nos. 51405XX01).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data can be obtained via the following website: zhvoice URL: https://github.com/fighting41love/zhvoice; VoxCeleb1 URL: https://mm.kaist.ac.kr/datasets/voxceleb; cn-celeb_v2 URL: http://openslr.org/82/; soundsnap URL: https://www.soundsnap.com/ (accessed on 6 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, Q.; Lin, X.D. Efficient and Privacy-Preserving Speaker Recognition for Cybertwin-Driven 6G. IEEE Internet Things J. 2021, 8, 16195–16206. [Google Scholar] [CrossRef]
Lee, S.W.; Lee, D.H. From Attack to Identification: MEMS Sensor Fingerprinting Using Acoustic Signals. IEEE Internet Things J. 2023, 10, 5447–5460. [Google Scholar] [CrossRef]
Zhang, L.; Tan, S.; Chen, Y.; Yang, J. A Continuous Articulatory-Gesture-Based Liveness Detection for Voice Authentication on Smart Devices. IEEE Internet Things J. 2022, 9, 23320–23331. [Google Scholar] [CrossRef]
Kinnunen, T.; Li, H. An overview of text-independent speaker recog-nition: From features to supervectors. Speech Commun. 2010, 52, 12–40. [Google Scholar] [CrossRef]
Li, X.; Ze, J. Enrollment-Stage Backdoor Attacks on Speaker Recognition Systems via Adversarial Ultrasound. IEEE Internet Things J. 2024, 11, 13108–13124. [Google Scholar] [CrossRef]
Dong, Y.; Yao, Y.D. Secure mmWave-Radar-Based Speaker Verification for IoT Smart Home. IEEE Internet Things J. 2021, 8, 3500–3511. [Google Scholar] [CrossRef]
Huang, W.; Tang, W.; Jiang, H.; Luo, J.; Zhang, Y. Stop Deceiving! An Effective Defense Scheme Against Voice Impersonation Attacks on Smart Devices. IEEE Internet Things J. 2022, 9, 5304–5314. [Google Scholar] [CrossRef]
Bian, J.; Al Arafat, A. Machine Learning in Real-Time Internet of Things (IoT) Systems: A Survey. IEEE Internet Things J. 2022, 9, 8364–8386. [Google Scholar] [CrossRef]
Hanifa, R.M.; Isa, K.; Mohamad, S. A review on speaker recognition: Technology and challenges. Comput. Electr. Eng. 2021, 90, 107005. [Google Scholar] [CrossRef]
Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-vectors: Robust DNN embeddings for speaker recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning forimage recognition. In Proceedings of the Computer Vision and Pattern Recognition (CVPR) 2016, IEEE Computer Society, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Variani, E.; Lei, X.; McDermott, E.; Moreno, I.L.; Dominguez, J.G. Deep neural networks for small footprint text-dependent speaker verification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP, Florence, Italy, 4–9 May 2014; pp. 4052–4056. [Google Scholar]
Snyder, D.; Garcia-Romero, D.; Sell, G.; McCree, A.; Povey, D.; Khudanpur, S. Speaker recognition for multi-speaker conversations using x-vectors. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP, Brighton, UK, 12–17 May 2019; pp. 5796–5800. [Google Scholar]
Garcia-Romero, D.; McCree, A.; Snyder, D.; Sell, G. JHUHLTCOE system for the VoxSRC speaker recognition challenge. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP, Virtual, 4–9 May 2020; pp. 7559–7563. [Google Scholar]
Shi, Y.; Huang, Q.; Hain, T. H-VECTORS: Improving the robustness in utterance-level speaker embeddings using a hierarchical attention model. Neural Netw. 2021, 142, 329–339. [Google Scholar] [CrossRef]
Zhou, T.; Zhao, Y.; Wu, J. Resnext and res2net structures for speaker verification. In Proceedings of the IEEE Spoken Language Technology Workshop 2021, Shenzhen, China, 19–22 January 2021; pp. 301–307. [Google Scholar]
Liu, B.; Chen, Z.; Wang, S.; Wang, H.; Han, B.; Qian, Y. Dfresnet: Boosting speaker verification performance with depth-first design. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; Ko, H., Hansen, J.H.L., Eds.; ISCA: Singapore, 2022; pp. 296–300. [Google Scholar]
Gao, Z.; Song, Y.; McLoughlin, I.V.; Guo, W.; Dai, L. An improved deep embedding learning method for short duration speaker verification. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 3578–3582. [Google Scholar]
Chen, C.P.; Zhang, S.Y.; Yeh, C.T.; Wang, J.C.; Wang, T.; Huang, C.L. Speaker characterization using TDNN-LSTM based speaker embedding. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6211–6215. [Google Scholar]
Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPATDNN: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. In Proceedings of the 21st Annual conference of the International Speech Communication Association INTER SPEECH, Shanghai, China, 25–29 October 2020; pp. 3830–3834. [Google Scholar]
Togneri, R.; Pullella, D. An overview of speaker identification: Accuracy and robustness issues. IEEE Circuits Syst. Mag. 2011, 11, 23–61. [Google Scholar] [CrossRef]
Chen, Z. On the Detection of Adaptive Adversarial Attacks in Speaker Verification Systems. IEEE Internet Things J. 2023, 10, 16271–16283. [Google Scholar] [CrossRef]
Li, X.; Zheng, Z.; Yan, C.; Li, C.; Ji, X.; Xu, W. Toward Pitch-Insensitive Speaker Verification via Soundfield. IEEE Internet Things J. 2024, 11, 1175–1189. [Google Scholar] [CrossRef]
Kheder, W.B.; Matrouf, D. Additive noise compensation in the i-vector space for speaker recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing ICASSP, Brisbane, Australia, 19–24 April 2015; pp. 4190–4194. [Google Scholar]
Juneja, K. Two level Noise Robust and Block Featured PNN Model for Speaker Recognition in Real Environment. Wirel. Pers. Commun. 2022, 125, 3741–3771. [Google Scholar] [CrossRef]
O’Malley, T.; Ding, S.; Narayanan, A.; Wang, Q.; Rikhye, K.; Liang, Q.; He, Y.; McGraw, I. Conditional Conformer: Improving Speaker Modulation For Single And Multi-User Speech Enhancement. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Liu, X.; Zhang, T.; Liu, M. Joint estimation of pose, depth, and optical flow with a competition–cooperation transformer network. Neural Netw. 2024, 171, 263–275. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Zhang, T.; Liu, M. UDF-GAN: Unsupervised dense optical-flow estimation using cycle Generative Adversarial Networks. Knowl.-Based Syst. 2023, 271, 110568. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems (NeurIPS); Wallach, H., Larochelle, H., Beygelzimer, A., d’Alche-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2019; Volume 32. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. JMLR 2020, 21, 1–67. [Google Scholar]
Hsu, W.-N.; Bolte, B.; Tsai, Y.-H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. arXiv 2021, arXiv:2106.07447. [Google Scholar] [CrossRef]
Chung, Y.-A.; Hsu, W.-N.; Tang, H.; Glass, J. An unsupervised autoregressive model for speech representation learning. arXiv 2019, arXiv:1904.03240. [Google Scholar]
Baevski, A.; Schneider, S.; Auli, M. vq-wav2vec: Selfsupervised learning of discrete speech representations. arXiv 2019, arXiv:1910.05453. [Google Scholar]
Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv 2020, arXiv:2006.11477. [Google Scholar]
Chen, Z.; Chen, S. Large-scale self-supervised speech representation learning for automatic speaker verification. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 7–13 May 2022. [Google Scholar]
Yan, X.C.; Lin, Z.H. A Novel Exploitative and Explorative GWO-SVM Algorithm for Smart Emotion Recognition. IEEE Internet Things J. 2023, 10, 9999–10011. [Google Scholar] [CrossRef]
Chen, S.; Wang, C. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. arXiv 2022, arXiv:2110.13900v5. [Google Scholar] [CrossRef]
Wang, H.; Zheng, S.; Chen, Y.; Cheng, L.; Chen, Q. CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking. arXiv 2023, arXiv:2303.00332. [Google Scholar]
Slade, S.; Zhang, L.; Huang, H.; Asadi, H.; Lim, C.P.; Yu, Y.; Zhao, D.; Lin, H.; Gao, R. Neural Inference Search for Multiloss Segmentation Models. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 15113–15127. [Google Scholar] [CrossRef]
Zhang, L.; Slade, S.; Lim, C.P.; Asadi, H.; Nahavandi, S.; Huang, H.; Ruan, H. Semantic segmentation using Firefly Algorithm-based evolving ensemble deep neural networks. Knowl.-Based Syst. 2023, 277, 110828. [Google Scholar] [CrossRef]
Wang, A.; Xu, Y.; Wei, X.; Cui, B.B. Semantic Segmentation of Crop and Weed using an Encoder-Decoder Network and Image Enhancement Method under Uncontrolled Outdoor Illumination. IEEE Access 2020, 8, 81724–81734. [Google Scholar] [CrossRef]
Wang, A.; Pen, T.; Cao, H.; Xu, Y.; Wei, X.; Cui, B.B. TIA-YOLOv5: An improved YOLOv5 network for real-time detection of crop and weed in the field. Front. Plant Sci. 2022, 13, 1091655. [Google Scholar] [CrossRef]
Shen, C.; Wu, Y.; Qian, G.; Wu, X.; Cao, H.; Wang, C.; Tang, J.; Liu, J. Intelligent Bionic Polarization Orientation Method Using Biological Neuron Model for Harsh Conditions. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 789–806. [Google Scholar] [CrossRef]
Wu, X.; Liu, J.; Shen, C.; Cao, H.; Wang, C.; Tang, J. Vehicle-Mounted Polarization Orientation Strategy Considering Inclined Pavement and Occlusion Conditions. IEEE Trans. Intell. Veh. 2024. early access. [Google Scholar] [CrossRef]
Shen, C.; Zhao, X.; Wu, X.; Cao, H.; Wang, C.; Tang, J.; Liu, J. Multiaperture Visual Velocity Measurement Method Based on Biomimetic Compound-Eye for UAVs. IEEE Internet Things J. 2024, 11, 11165–11174. [Google Scholar] [CrossRef]
Liu, X.; Tang, J.; Shen, C.; Wang, C.; Zhao, D.; Guo, X.; Li, J.; Liu, J. Brain-like position measurement method based on improved optical flow algorithm. ISA Trans. 2023, 143, 221–230. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Tang, J.; Cao, H.; Wang, C.; Shen, C.; Liu, J. Cascaded Speech Separation Denoising and Dereverberation Using Attention and TCN-WPE Networks for Speech Devices. IEEE Internet Things J. 2024, 11, 18047–18058. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS); MIT Press: Cambridge, MA, USA, 2017; pp. 5998–6008. [Google Scholar]
Yu, Y.-Q.; Li, W.-J. Densely connected time delay neural network for speaker verification. In Proceedings of the Annual Conference of the International Speech Communication Association (INTER SPEECH), Shanghai, China, 25–29 October 2020; pp. 921–925. [Google Scholar]
Lloyd, S. Least squares quantization in pcm. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
Zhang, G.; Zhang, C.; Zhang, H. Improved K-means Algorithm Based on Density Canopy. Knowl.-Based Syst. 2018, 145, 289–297. [Google Scholar] [CrossRef]
Arthur, D.; Vassilvitskii, S. K-Means++: The Advantages of Careful Seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, LA, USA, 7–9 January 2007. [Google Scholar]
Shen, C.; Xiong, Y.; Zhao, D.; Wang, C.; Cao, H.; Song, X.; Tang, J.; Liu, J. Multi-rate strong tracking square-root cubature Kalman filter for MEMS-INS/GPS/polarization compass integrated navigation system. Mech. Syst. Signal Process. 2022, 163, 108146. [Google Scholar] [CrossRef]
India, M.; Safari, P.; Hernando, J. Self Multi-Head Attention for Speaker Recognition. arXiv 2019, arXiv:1906.09890. [Google Scholar]
Scheibler, R.; Bezzam, E.; Dokmanic, I. Pyroomacoustics: A python package for audio room simulation and array processing algorithms. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 351–355. [Google Scholar]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arc-face: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4690–4699. [Google Scholar]
Zhang, Y.; Lv, Z.; Wu, H.; Zhang, S.; Hu, P.; Wu, Z.; Lee, H.; Meng, H. Mfa-conformer: Multi-scale feature aggregation conformer for automatic speaker verification. arXiv 2022, arXiv:2203.15249. [Google Scholar]
Wang, H.; Liang, C.; Wang, S.; Chen, Z.; Zhang, B.; Xiang, X.; Deng, Y.; Qian, Y. Wespeaker: A Research and Production Oriented Speaker Embedding Learning Toolkit. In Proceedings of the ICASSP 2023 IEEE, Rhodes Island, Greece, 4–10 June 2023. [Google Scholar]
Yakovlev, I.; Makarov, R.; Balykin, A.; Malov, P.; Okhotnikov, A.; Torgashov, N. Reshape Dimensions Network for Speaker Recognition. arXiv 2024, arXiv:2407.18223. [Google Scholar]
Liu, T.; Lee, K.A.; Wang, Q.; Li, H. Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 2324–2337. [Google Scholar] [CrossRef]
Nagrani, A.; Chung, J.S.; Zisserman, A. VoxCeleb: A largescale speaker identification dataset. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 2616–2620. [Google Scholar]
Fan, Y.; Kang, J.W.; Li, L.T.; Li, K.C.; Chen, H.L.; Cheng, S.T.; Zhang, P.Y.; Zhou, Z.Y.; Cai, Y.Q.; Wang, D. CN-Celeb: A Challenging Chinese Speaker Recognition Dataset. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing IEEE, Barcelona, Spain, 4–8 May 2020. [Google Scholar]

Figure 1. Various spectrograms for actual sound field situations: (a) clean spectrum; (b) with weak noise; (c) with strong noise; (d) with reverberation; (e) with weak noise and reverberation; (f) with strong noise and reverberation; (g) noise spectrum at low frequency; (h) speaker-like noise spectrum; (i) noise spectrum across frequencies.

Figure 2. The architecture of HuBERT and WavLM.

Figure 3. The pipeline of PNR. The PNR processes contaminated speech, which is encoded sequentially through CNN and HuBERT encoders. The representation is then transformed into output by a projection layer.

Figure 4. The PNR-TDNN pipeline and the specifics of the proposed TDNN architecture are as follows: (a) depicts the conversion of speech to loss; (b) outlines the structure of the fusion Res2Net within the TDNN module; (c,c1) detail the dimensionality of the features in the SE-attentive mean module; (d) illustrates the dimension transformation of features in the multi-head attentive statistics pooling module, while (d1) provides the computation details of the pooling factor

α

.

Figure 4. The PNR-TDNN pipeline and the specifics of the proposed TDNN architecture are as follows: (a) depicts the conversion of speech to loss; (b) outlines the structure of the fusion Res2Net within the TDNN module; (c,c1) detail the dimensionality of the features in the SE-attentive mean module; (d) illustrates the dimension transformation of features in the multi-head attentive statistics pooling module, while (d1) provides the computation details of the pooling factor

α

.

Figure 5. The algorithm of Canopy/Mini Batch k-means++.

Figure 6. Different Res2nets comparisons.

Figure 7. Conference room for noise acquisition.

Figure 8. Noise acquisition equipment.

Figure 9. Number of clusters corresponding to

T_{1}

and

T_{2}

.

Figure 9. Number of clusters corresponding to

T_{1}

and

T_{2}

.

Figure 10. Convergence of HuBERT/WavLM.

Figure 11. Metrics of different PNR-TDNNs in dataset A/B/C: (a) EER in dataset A; (b) EER in dataset B; (c) EER in dataset C; (d) MinDCF in dataset A; (e) MinDCF in dataset B; (f) MinDCF in dataset C.

Figure 12. Histogram of significant differences between the four baseline models and the THu+ EC+FAM model on dataset A (a, b, c, d, and e represent the comparison notation between the groups corresponding to the proposed T-Hu+EC+FAM, Gemini, RedimNet, CAM++, and ECAPATDNN, respectively).

Figure 13. Best EERs of the four baseline models and the T-Hu + EC + FAM model (blue for dataset A, red for dataset B, and gray for dataset C).

Figure 14. Best EERs of the four baseline models and the T-Hu+EC+FAM model (blue for cnceleb_v2-B, red for VoxCeleb1-B).

Table 1. Parameters of base HuBERT.

CNN Encoder (Seven Convolutional Layers)	Transformer Encoders	Projection Layer
strides [5,2,2,2,2,2,2,2]	layers [12]	dimension [256]
kernels [10,3,3,3,3,3,2,2]	attention heads [12]	dimension [256]

Table 2. Reverb configuration (u: uniform distribution).

Room/m	L (length)	u(5, 10)	Source/m	L	L/2+u(−0.2, 0.2)
	W (width)	u(5, 10)		W	W/2 + u(−0.2, 0.2)
	H (height)	u(3, 4)		H	u(0.9, 1.8)
$T_{60}$	low	u(0.1, 0.5)	Microphone/m	L	L/2 + u(−1.6, −0.8) or u(0.8, 1.6)
	middle	u(0.5, 1)		W	W/2 + u(−1.6, −0.8) or u(0.8, 1.6)
	high	u(1, 1.5)		H	u(5, 10)

Table 3. Dataset for speaker verification (percentage: refers to the proportion of selected data to the original unlabeled speech, “✓” denotes containing, noise source: soundsnap and actual record).

	Percentage (%)	Noise (SNR: $0 \sim 1$ )	Reverb	Speakers Number	Training Set Amount	Test Set Amount
dataset A	6.05	✓		877	66,000	855
dataset B	4.67	✓	✓	696	51,000	642
dataset C	2.75	✓	✓	422	30,000	359

Table 4. Baseline model’s best EER and MinDCF.

Methodology	Dataset A		Dataset B		Dataset C		Params	Computational Efficiency
Methodology	EER (%)	MinDCF	EER (%)	MinDCF	EER (%)	MinDCF	Model (M)	Flops (G)	RTF (1 × 10⁻³)
TDNN	7.52	0.201	9.47	0.192	7.35	0.155	11.3	0.318	0.47
Resnet-SE	8.40	0.184	8.72	0.188	9.73	0.159	56.1	3.719	11.56
Res2net	10.26	0.204	10.77	0.195	9.78	0.169	111.6	1.002	2.97
Eres2net	7.61	0.196	7.89	0.189	6.65	0.160	27.4	2.428	7.18
ECAPA-TDNN [20]	7.18	0.188	8.44	0.190	6.77	0.154	33.9	0.973	1.23
CAM++ [39]	5.97	0.178	7.81	0.191	6.35	0.159	28.7	0.813	4.63
MFA-Conformer [58]	7.98	0.198	8.78	0.189	7.40	0.160	77.0	0.994	2.48
Wespeaker [59]	10.88	0.206	12.12	0.197	10.22	0.174	0.96	0.006	0.19
RedimNet [60]	6.59	0.197	8.19	0.192	7.45	0.166	21.1	1.290	8.79
Gemini [61]	7.78	0.201	8.43	0.190	8.12	0.163	26.7	3.834	26.7

Table 5. PNR-TDNN’s best EER and MinDCF in different clustering numbers (Note: Hu = HuBERT, S-Hu = small HuBERT, T-Hu = tiny HuBERT, EC = ECAPA-TDNN, Wa = WavLM, CA = CAM++).

Methodology	Dataset A		Dataset B		Dataset C		Params	Computational Efficiency
Methodology	EER (%)	MinDCF	EER (%)	MinDCF	EER (%)	MinDCF	Model (M)	Flops (G)	RTF (1 × 10⁻³)
Hu + EC(60)	6.41	0.189	8.51	0.196	7.56	0.161	405.1	16.11	23.50
Hu + EC(70)	6.39	0.190	8.10	0.195	6.84	0.161	405.1	16.11	23.50
Hu + EC(80)	6.35	0.186	7.84	0.194	6.01	0.158	405.1	16.11	23.50
Hu + EC(90)	6.58	0.195	8.37	0.197	7.35	0.164	405.1	16.11	23.50
Hu + EC(100)	12.69	0.208	9.87	0.201	8.81	0.172	405.1	16.11	23.50
Hu + EC(110)	25.78	0.832	29.31	0.978	24.69	0.728	405.1	16.11	23.50
Hu + EC(120)	32.66	1.235	37.57	1.424	33.71	1.354	405.1	16.11	23.50
Wa + EC(80)	6.53	0.193	9.21	0.194	7.56	0.162	405.1	16.11	25.20
Hu + CA(80)	6.33	0.192	7.99	0.199	7.16	0.166	404.6	16.24	26.80
S-Hu + EC(80)	5.78	0.196	7.34	0.191	5.94	0.172	140.6	9.55	15.67
T-Hu + EC(80)	5.70	0.186	8.05	0.185	6.06	0.156	90.2	8.46	13.32

Table 6. Parameters of compressed HuBERT.

	Transformer Encoders	Feed Forward Net
base HuBERT	Layers: 12 heads: 12	dim: 3072
small HuBERT	Layers: 4 heads: 12	dim: 1536
tiny HuBERT	Layers: 2 heads: 12	dim: 512

Table 7. PNR-TDNN’s best EER and MinDCF in multiple enhancement methods (Note: Hu = HuBERT, EC = ECAPA-TDNN, S-Hu = small HuBERT, T-Hu = tiny HuBERT, F = fusion Res2net, A = SE-attentive mean, M = multi-heads attentive statistics pooling).

	Methodology	Dataset A		Dataset B		Dataset C
	Methodology	EER (%)	MinDCF	EER (%)	MinDCF	EER (%)	MinDCF
1	Hu + EC (original Res2net)	6.38	0.188	8.37	0.193	6.17	0.157
2	Hu + EC (reverse Res2net)	6.51	0.188	8.11	0.190	5.97	0.163
3	Hu + EC + F	5.82	0.181	8.09	0.192	5.93	0.161
4	Hu + EC + A	6.14	0.180	7.76	0.191	5.38	0.165
5	Hu + EC + M	5.82	0.183	7.67	0.191	5.71	0.158
6	Hu + EC + AM	5.88	0.182	7.58	0.192	4.87	0.155
7	Hu + EC + FAM	5.65	0.178	7.47	0.190	4.84	0.156
8	S-Hu + EC + FAM	5.40	0.180	6.94	0.190	5.65	0.161
9	T-Hu + EC + FAM	5.66	0.182	7.52	0.185	5.74	0.153

Table 8. PNR-TDNN’s best EER and MinDCF in ESC50 and Cnceleb noise situations (Note:EC = ECAPA-TDNN, Hu = HuBERT, S-Hu = small HuBERT, T-Hu = tiny HuBERT, FAM = fusion Res2net + SE-attentive mean + multi-head attentive statistics pooling).

Noise	Methodology	Dataset A		Dataset B		Dataset C
Sources	Methodology	EER (%)	MinDCF	EER (%)	MinDCF	EER (%)	MinDCF
ESC50/CnCeleb	EC	9.42/8.89	0.200/0.203	9.41/9.45	0.193/0.195	7.31/7.08	0.170/0.170
	CAM++	8.21/7.68	0.197/0.198	8.83/10.43	0.193/0.197	8.58/8.30	0.174/0.177
	MFA-Conformer	8.95/10.52	0.204/0.203	9.91/9.81	0.194/0.192	6.96/7.03	0.171/0.167
	Wespeaker	13.3/13.3	0.209/0.207	13.2/13.6	0.197/0.197	11.0/11.8	0.176/0.174
	RedimNet	9.11/8.39	0.209/0.208	8.67/8.89	0.194/0.195	10.63/9.15	0.176/0.171
	Gemini	8.76/9.14	0.207/0.205	9.37/9.70	0.194/0.194	8.69/9.36	0.175/0.175
	Hu + EC	7.61/7.43	0.199/0.195	8.35/8.95	0.191/0.194	7.11/6.84	0.160/0.160
	Hu + EC + FAM	7.41/7.27	0.196/0.192	8.12/8.71	0.190/0.190	6.80/6.63	0.162/0.156
	S-Hu + EC + FAM	5.87/5.79	0.191/0.188	7.27/6.98	0.187/0.185	5.25/4.19	0.155/0.146
	T-Hu + EC + FAM	6.26/6.44	0.200/0.190	8.35/7.63	0.188/0.185	6.51/5.39	0.168/0.145

Table 9. PNR-TDNN’s best EER and MinDCF in a wide SNR range (−5∼3 dB) (Note:EC = ECAPA-TDNN, Hu = HuBERT, S-Hu = small HuBERT, T-Hu = tiny HuBERT, FAM = fusion Res2net + SE-attentive mean + multi-head attentive statistics pooling).

Methodology	Dataset A		Dataset B		Dataset C
Methodology	EER (%)	MinDCF	EER (%)	MinDCF	EER (%)	MinDCF
EC	8.72	0.201	9.59	0.188	6.44	0.157
CAM++	7.43	0.196	8.62	0.192	17.22	0.334
MFA-Conformer	9.25	0.203	9.32	0.188	6.99	0.161
Wespeaker	12.43	0.208	12.04	0.195	10.26	0.174
RedimNet	7.97	0.206	8.39	0.187	6.53	0.160
Gemini	7.77	0.202	8.51	0.191	8.86	0.166
Hu + EC	6.65	0.197	8.57	0.191	6.88	0.159
Hu + EC + FAM	6.49	0.196	7.95	0.185	6.13	0.154
S-Hu + EC + FAM	6.73	0.195	8.48	0.189	6.36	0.162
T-Hu + EC + FAM	7.20	0.202	8.62	0.191	7.02	0.163

Table 10. VoxCeleb1-A/B and cn-celeb_v2-A/B setup (percentage: refers to the proportion to the original unlabeled speech(VoxCeleb1-S or cn-celeb_v2-S)).

	Percentage (%)	Training Set Amount	Test Set Amount
VoxCeleb1-A	8	8098	142
VoxCeleb1-B	40.2	40,398	478
cn-celeb_v2-A	14.5	8035	155
cn-celeb_v2-B	49	26,618	570

Table 11. PNR-TDNN’s best EER and MinDCF in other public datasets (Note:EC=ECAPA-TDNN, S-Hu = small HuBERT, T-Hu = tiny HuBERT, Hu = HuBERT, FAM= fusion Res2net + SE-attentive mean + multi-head attentive statistics pooling).

Methodology	VoxCeleb1-A/B (Noise)		VoxCeleb1-A/B (Noise and Reverb)		cnceleb_v2-A/B (Noise)		cnceleb_v2-A/B (Noise and Reverb)
Methodology	EER (%)	MinDCF	EER (%)	MinDCF	EER (%)	MinDCF	EER (%)	MinDCF
EC	6.87/13.5	0.250/0.331	18.0/23.3	0.308/0.509	12.6/8.68	0.201/0.334	14.1/14.0	0.241/0.438
CAM++	15.9/15.5	0.306/0.409	34.7/59.7	0.403/0.930	13.8/7.52	0.235/0.310	16.2/13.6	0.241/0.426
MFA-Conformer	17.1/15.5	0.329/0.421	19.3/24.2	0.356/0.573	13.4/11.9	0.241/0.504	15.8/18.7	0.255/0.560
Wespeaker	13.9/25.7	0.319/0.582	22.2/34.7	0.351/0.624	16.6/20.3	0.246/0.768	18.0/27.3	0.270/0.818
RedimNet	13.9/15.1	0.306/0.412	16.4/25.6	0.335/0.580	14.8/10.3	0.236/0.369	17.2/16.5	0.241/0.579
Gemini	17.0/18.9	0.351/0.468	20.5/22.4	0.329/0.545	16.7/12.1	0.255/0.493	19.9/16.9	0.266/0.620
Hu + EC	4.62/9.96	0.247/0.212	17.1/21.7	0.304/0.536	12.2/7.15	0.203/0.309	14.2/13.2	0.244/0.445
Hu + EC + FAM	3.30/9.27	0.208/0.205	15.8/20.9	0.296/0.523	11.9/6.73	0.197/0.281	13.8/12.8	0.240/0.414
S-Hu + EC + FAM	3.06/9.78	0.195/0.275	15.0/19.2	0.307/0.505	10.9/4.87	0.240/0.258	12.9/11.9	0.233/0.408
T-Hu + EC + FAM	4.39/9.19	0.219/0.244	16.5/20.5	0.313/0.542	9.69/5.26	0.221/0.266	14.5/13.3	0.241/0.449

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Tang, J.; Cao, H.; Wang, C.; Shen, C.; Liu, J. A Self-Supervised Method for Speaker Recognition in Real Sound Fields with Low SNR and Strong Reverberation. Appl. Sci. 2025, 15, 2924. https://doi.org/10.3390/app15062924

AMA Style

Zhang X, Tang J, Cao H, Wang C, Shen C, Liu J. A Self-Supervised Method for Speaker Recognition in Real Sound Fields with Low SNR and Strong Reverberation. Applied Sciences. 2025; 15(6):2924. https://doi.org/10.3390/app15062924

Chicago/Turabian Style

Zhang, Xuan, Jun Tang, Huiliang Cao, Chenguang Wang, Chong Shen, and Jun Liu. 2025. "A Self-Supervised Method for Speaker Recognition in Real Sound Fields with Low SNR and Strong Reverberation" Applied Sciences 15, no. 6: 2924. https://doi.org/10.3390/app15062924

APA Style

Zhang, X., Tang, J., Cao, H., Wang, C., Shen, C., & Liu, J. (2025). A Self-Supervised Method for Speaker Recognition in Real Sound Fields with Low SNR and Strong Reverberation. Applied Sciences, 15(6), 2924. https://doi.org/10.3390/app15062924

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Self-Supervised Method for Speaker Recognition in Real Sound Fields with Low SNR and Strong Reverberation

Abstract

1. Introduction

2. Background and Proposed Method

2.1. Problem Description

2.2. Speech Model of SSL

2.3. Speaker Verification Model

2.4. Proposed Method Overview

2.4.1. Network Architecture

2.4.2. Canopy/Mini Batch K-Means++

2.4.3. Fusion Res2net and SE-Attentive Mean

2.4.4. Multi-Head Attentive Statistics Pooling

3. Experimental Setup

3.1. Dataset Generation

3.2. Hyperparameter Settings

3.3. Evaluation Metrics

4. Results and Comparison

4.1. Pre-Trained Model Experiments

4.2. PNR-TDNN Experiments

4.2.1. Discussion of Baselines

4.2.2. Discussion of Clusters and Compressed Models

4.2.3. Discussion of PNR-TDNN Ablation Experiments

4.2.4. Discussion of Other Noisy Situations

4.2.5. Discussion of Other SNR’s Situations

4.2.6. Discussion of Other Public Datasets

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI