Author Contributions
Conceptualization, X.Z.; methodology, X.Z. and C.S.; software, X.Z.; validation, X.Z. and C.S.; formal analysis, X.Z. and C.S.; investigation, X.Z. and C.S.; resources, J.T. and J.L.; data curation, X.Z. and C.S.; writing—original draft preparation, X.Z.; writing—review and editing, X.Z., C.S., H.C. and C.W.; visualization, X.Z.; supervision, J.T. and J.L.; project administration, C.S.; funding acquisition, J.L., J.T., C.S., C.W. and H.C. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Various spectrograms for actual sound field situations: (a) clean spectrum; (b) with weak noise; (c) with strong noise; (d) with reverberation; (e) with weak noise and reverberation; (f) with strong noise and reverberation; (g) noise spectrum at low frequency; (h) speaker-like noise spectrum; (i) noise spectrum across frequencies.
Figure 1.
Various spectrograms for actual sound field situations: (a) clean spectrum; (b) with weak noise; (c) with strong noise; (d) with reverberation; (e) with weak noise and reverberation; (f) with strong noise and reverberation; (g) noise spectrum at low frequency; (h) speaker-like noise spectrum; (i) noise spectrum across frequencies.
Figure 2.
The architecture of HuBERT and WavLM.
Figure 2.
The architecture of HuBERT and WavLM.
Figure 3.
The pipeline of PNR. The PNR processes contaminated speech, which is encoded sequentially through CNN and HuBERT encoders. The representation is then transformed into output by a projection layer.
Figure 3.
The pipeline of PNR. The PNR processes contaminated speech, which is encoded sequentially through CNN and HuBERT encoders. The representation is then transformed into output by a projection layer.
Figure 4.
The PNR-TDNN pipeline and the specifics of the proposed TDNN architecture are as follows: (a) depicts the conversion of speech to loss; (b) outlines the structure of the fusion Res2Net within the TDNN module; (c,c1) detail the dimensionality of the features in the SE-attentive mean module; (d) illustrates the dimension transformation of features in the multi-head attentive statistics pooling module, while (d1) provides the computation details of the pooling factor .
Figure 4.
The PNR-TDNN pipeline and the specifics of the proposed TDNN architecture are as follows: (a) depicts the conversion of speech to loss; (b) outlines the structure of the fusion Res2Net within the TDNN module; (c,c1) detail the dimensionality of the features in the SE-attentive mean module; (d) illustrates the dimension transformation of features in the multi-head attentive statistics pooling module, while (d1) provides the computation details of the pooling factor .
Figure 5.
The algorithm of Canopy/Mini Batch k-means++.
Figure 5.
The algorithm of Canopy/Mini Batch k-means++.
Figure 6.
Different Res2nets comparisons.
Figure 6.
Different Res2nets comparisons.
Figure 7.
Conference room for noise acquisition.
Figure 7.
Conference room for noise acquisition.
Figure 8.
Noise acquisition equipment.
Figure 8.
Noise acquisition equipment.
Figure 9.
Number of clusters corresponding to and .
Figure 9.
Number of clusters corresponding to and .
Figure 10.
Convergence of HuBERT/WavLM.
Figure 10.
Convergence of HuBERT/WavLM.
Figure 11.
Metrics of different PNR-TDNNs in dataset A/B/C: (a) EER in dataset A; (b) EER in dataset B; (c) EER in dataset C; (d) MinDCF in dataset A; (e) MinDCF in dataset B; (f) MinDCF in dataset C.
Figure 11.
Metrics of different PNR-TDNNs in dataset A/B/C: (a) EER in dataset A; (b) EER in dataset B; (c) EER in dataset C; (d) MinDCF in dataset A; (e) MinDCF in dataset B; (f) MinDCF in dataset C.
Figure 12.
Histogram of significant differences between the four baseline models and the THu+ EC+FAM model on dataset A (a, b, c, d, and e represent the comparison notation between the groups corresponding to the proposed T-Hu+EC+FAM, Gemini, RedimNet, CAM++, and ECAPATDNN, respectively).
Figure 12.
Histogram of significant differences between the four baseline models and the THu+ EC+FAM model on dataset A (a, b, c, d, and e represent the comparison notation between the groups corresponding to the proposed T-Hu+EC+FAM, Gemini, RedimNet, CAM++, and ECAPATDNN, respectively).
Figure 13.
Best EERs of the four baseline models and the T-Hu + EC + FAM model (blue for dataset A, red for dataset B, and gray for dataset C).
Figure 13.
Best EERs of the four baseline models and the T-Hu + EC + FAM model (blue for dataset A, red for dataset B, and gray for dataset C).
Figure 14.
Best EERs of the four baseline models and the T-Hu+EC+FAM model (blue for cnceleb_v2-B, red for VoxCeleb1-B).
Figure 14.
Best EERs of the four baseline models and the T-Hu+EC+FAM model (blue for cnceleb_v2-B, red for VoxCeleb1-B).
Table 1.
Parameters of base HuBERT.
Table 1.
Parameters of base HuBERT.
CNN Encoder
(Seven Convolutional Layers) | Transformer Encoders | Projection Layer |
---|
strides [5,2,2,2,2,2,2,2] | layers [12] | dimension [256] |
kernels [10,3,3,3,3,3,2,2] | attention heads [12] |
Table 2.
Reverb configuration (u: uniform distribution).
Table 2.
Reverb configuration (u: uniform distribution).
Room/m | L (length) | u(5, 10) | Source/m | L | L/2+u(−0.2, 0.2) |
W (width) | u(5, 10) | W | W/2 + u(−0.2, 0.2) |
H (height) | u(3, 4) | H | u(0.9, 1.8) |
| low | u(0.1, 0.5) | Microphone/m | L | L/2 + u(−1.6, −0.8) or u(0.8, 1.6) |
middle | u(0.5, 1) | W | W/2 + u(−1.6, −0.8) or u(0.8, 1.6) |
high | u(1, 1.5) | H | u(5, 10) |
Table 3.
Dataset for speaker verification (percentage: refers to the proportion of selected data to the original unlabeled speech, “✓” denotes containing, noise source: soundsnap and actual record).
Table 3.
Dataset for speaker verification (percentage: refers to the proportion of selected data to the original unlabeled speech, “✓” denotes containing, noise source: soundsnap and actual record).
| Percentage (%) | Noise (SNR: ) | Reverb | Speakers Number | Training Set Amount | Test Set Amount |
---|
dataset A | 6.05 | ✓ | | 877 | 66,000 | 855 |
dataset B | 4.67 | ✓ | ✓ | 696 | 51,000 | 642 |
dataset C | 2.75 | ✓ | ✓ | 422 | 30,000 | 359 |
Table 4.
Baseline model’s best EER and MinDCF.
Table 4.
Baseline model’s best EER and MinDCF.
Methodology | Dataset A | Dataset B | Dataset C | Params | Computational Efficiency |
---|
EER (%) | MinDCF | EER (%) | MinDCF | EER (%) | MinDCF | Model (M) | Flops (G) | RTF (1 × 10−3) |
---|
TDNN | 7.52 | 0.201 | 9.47 | 0.192 | 7.35 | 0.155 | 11.3 | 0.318 | 0.47 |
Resnet-SE | 8.40 | 0.184 | 8.72 | 0.188 | 9.73 | 0.159 | 56.1 | 3.719 | 11.56 |
Res2net | 10.26 | 0.204 | 10.77 | 0.195 | 9.78 | 0.169 | 111.6 | 1.002 | 2.97 |
Eres2net | 7.61 | 0.196 | 7.89 | 0.189 | 6.65 | 0.160 | 27.4 | 2.428 | 7.18 |
ECAPA-TDNN [20] | 7.18 | 0.188 | 8.44 | 0.190 | 6.77 | 0.154 | 33.9 | 0.973 | 1.23 |
CAM++ [39] | 5.97 | 0.178 | 7.81 | 0.191 | 6.35 | 0.159 | 28.7 | 0.813 | 4.63 |
MFA-Conformer [58] | 7.98 | 0.198 | 8.78 | 0.189 | 7.40 | 0.160 | 77.0 | 0.994 | 2.48 |
Wespeaker [59] | 10.88 | 0.206 | 12.12 | 0.197 | 10.22 | 0.174 | 0.96 | 0.006 | 0.19 |
RedimNet [60] | 6.59 | 0.197 | 8.19 | 0.192 | 7.45 | 0.166 | 21.1 | 1.290 | 8.79 |
Gemini [61] | 7.78 | 0.201 | 8.43 | 0.190 | 8.12 | 0.163 | 26.7 | 3.834 | 26.7 |
Table 5.
PNR-TDNN’s best EER and MinDCF in different clustering numbers (Note: Hu = HuBERT, S-Hu = small HuBERT, T-Hu = tiny HuBERT, EC = ECAPA-TDNN, Wa = WavLM, CA = CAM++).
Table 5.
PNR-TDNN’s best EER and MinDCF in different clustering numbers (Note: Hu = HuBERT, S-Hu = small HuBERT, T-Hu = tiny HuBERT, EC = ECAPA-TDNN, Wa = WavLM, CA = CAM++).
Methodology | Dataset A | Dataset B | Dataset C | Params | Computational Efficiency |
---|
EER (%) | MinDCF | EER (%) | MinDCF | EER (%) | MinDCF | Model (M) | Flops (G) | RTF (1 × 10−3) |
---|
Hu + EC(60) | 6.41 | 0.189 | 8.51 | 0.196 | 7.56 | 0.161 | 405.1 | 16.11 | 23.50 |
Hu + EC(70) | 6.39 | 0.190 | 8.10 | 0.195 | 6.84 | 0.161 | 405.1 | 16.11 | 23.50 |
Hu + EC(80) | 6.35 | 0.186 | 7.84 | 0.194 | 6.01 | 0.158 | 405.1 | 16.11 | 23.50 |
Hu + EC(90) | 6.58 | 0.195 | 8.37 | 0.197 | 7.35 | 0.164 | 405.1 | 16.11 | 23.50 |
Hu + EC(100) | 12.69 | 0.208 | 9.87 | 0.201 | 8.81 | 0.172 | 405.1 | 16.11 | 23.50 |
Hu + EC(110) | 25.78 | 0.832 | 29.31 | 0.978 | 24.69 | 0.728 | 405.1 | 16.11 | 23.50 |
Hu + EC(120) | 32.66 | 1.235 | 37.57 | 1.424 | 33.71 | 1.354 | 405.1 | 16.11 | 23.50 |
Wa + EC(80) | 6.53 | 0.193 | 9.21 | 0.194 | 7.56 | 0.162 | 405.1 | 16.11 | 25.20 |
Hu + CA(80) | 6.33 | 0.192 | 7.99 | 0.199 | 7.16 | 0.166 | 404.6 | 16.24 | 26.80 |
S-Hu + EC(80) | 5.78 | 0.196 | 7.34 | 0.191 | 5.94 | 0.172 | 140.6 | 9.55 | 15.67 |
T-Hu + EC(80) | 5.70 | 0.186 | 8.05 | 0.185 | 6.06 | 0.156 | 90.2 | 8.46 | 13.32 |
Table 6.
Parameters of compressed HuBERT.
Table 6.
Parameters of compressed HuBERT.
| Transformer Encoders | Feed Forward Net |
---|
base HuBERT | Layers: 12 heads: 12 | dim: 3072 |
small HuBERT | Layers: 4 heads: 12 | dim: 1536 |
tiny HuBERT | Layers: 2 heads: 12 | dim: 512 |
Table 7.
PNR-TDNN’s best EER and MinDCF in multiple enhancement methods (Note: Hu = HuBERT, EC = ECAPA-TDNN, S-Hu = small HuBERT, T-Hu = tiny HuBERT, F = fusion Res2net, A = SE-attentive mean, M = multi-heads attentive statistics pooling).
Table 7.
PNR-TDNN’s best EER and MinDCF in multiple enhancement methods (Note: Hu = HuBERT, EC = ECAPA-TDNN, S-Hu = small HuBERT, T-Hu = tiny HuBERT, F = fusion Res2net, A = SE-attentive mean, M = multi-heads attentive statistics pooling).
| Methodology | Dataset A | Dataset B | Dataset C |
---|
EER (%) | MinDCF | EER (%) | MinDCF | EER (%) | MinDCF |
---|
1 | Hu + EC (original Res2net) | 6.38 | 0.188 | 8.37 | 0.193 | 6.17 | 0.157 |
2 | Hu + EC (reverse Res2net) | 6.51 | 0.188 | 8.11 | 0.190 | 5.97 | 0.163 |
3 | Hu + EC + F | 5.82 | 0.181 | 8.09 | 0.192 | 5.93 | 0.161 |
4 | Hu + EC + A | 6.14 | 0.180 | 7.76 | 0.191 | 5.38 | 0.165 |
5 | Hu + EC + M | 5.82 | 0.183 | 7.67 | 0.191 | 5.71 | 0.158 |
6 | Hu + EC + AM | 5.88 | 0.182 | 7.58 | 0.192 | 4.87 | 0.155 |
7 | Hu + EC + FAM | 5.65 | 0.178 | 7.47 | 0.190 | 4.84 | 0.156 |
8 | S-Hu + EC + FAM | 5.40 | 0.180 | 6.94 | 0.190 | 5.65 | 0.161 |
9 | T-Hu + EC + FAM | 5.66 | 0.182 | 7.52 | 0.185 | 5.74 | 0.153 |
Table 8.
PNR-TDNN’s best EER and MinDCF in ESC50 and Cnceleb noise situations (Note:EC = ECAPA-TDNN, Hu = HuBERT, S-Hu = small HuBERT, T-Hu = tiny HuBERT, FAM = fusion Res2net + SE-attentive mean + multi-head attentive statistics pooling).
Table 8.
PNR-TDNN’s best EER and MinDCF in ESC50 and Cnceleb noise situations (Note:EC = ECAPA-TDNN, Hu = HuBERT, S-Hu = small HuBERT, T-Hu = tiny HuBERT, FAM = fusion Res2net + SE-attentive mean + multi-head attentive statistics pooling).
Noise | Methodology | Dataset A | Dataset B | Dataset C |
---|
Sources | EER (%) | MinDCF | EER (%) | MinDCF | EER (%) | MinDCF |
---|
ESC50/CnCeleb | EC | 9.42/8.89 | 0.200/0.203 | 9.41/9.45 | 0.193/0.195 | 7.31/7.08 | 0.170/0.170 |
CAM++ | 8.21/7.68 | 0.197/0.198 | 8.83/10.43 | 0.193/0.197 | 8.58/8.30 | 0.174/0.177 |
MFA-Conformer | 8.95/10.52 | 0.204/0.203 | 9.91/9.81 | 0.194/0.192 | 6.96/7.03 | 0.171/0.167 |
Wespeaker | 13.3/13.3 | 0.209/0.207 | 13.2/13.6 | 0.197/0.197 | 11.0/11.8 | 0.176/0.174 |
RedimNet | 9.11/8.39 | 0.209/0.208 | 8.67/8.89 | 0.194/0.195 | 10.63/9.15 | 0.176/0.171 |
Gemini | 8.76/9.14 | 0.207/0.205 | 9.37/9.70 | 0.194/0.194 | 8.69/9.36 | 0.175/0.175 |
Hu + EC | 7.61/7.43 | 0.199/0.195 | 8.35/8.95 | 0.191/0.194 | 7.11/6.84 | 0.160/0.160 |
Hu + EC + FAM | 7.41/7.27 | 0.196/0.192 | 8.12/8.71 | 0.190/0.190 | 6.80/6.63 | 0.162/0.156 |
S-Hu + EC + FAM | 5.87/5.79 | 0.191/0.188 | 7.27/6.98 | 0.187/0.185 | 5.25/4.19 | 0.155/0.146 |
T-Hu + EC + FAM | 6.26/6.44 | 0.200/0.190 | 8.35/7.63 | 0.188/0.185 | 6.51/5.39 | 0.168/0.145 |
Table 9.
PNR-TDNN’s best EER and MinDCF in a wide SNR range (−5∼3 dB) (Note:EC = ECAPA-TDNN, Hu = HuBERT, S-Hu = small HuBERT, T-Hu = tiny HuBERT, FAM = fusion Res2net + SE-attentive mean + multi-head attentive statistics pooling).
Table 9.
PNR-TDNN’s best EER and MinDCF in a wide SNR range (−5∼3 dB) (Note:EC = ECAPA-TDNN, Hu = HuBERT, S-Hu = small HuBERT, T-Hu = tiny HuBERT, FAM = fusion Res2net + SE-attentive mean + multi-head attentive statistics pooling).
Methodology | Dataset A | Dataset B | Dataset C |
---|
EER (%) | MinDCF | EER (%) | MinDCF | EER (%) | MinDCF |
---|
EC | 8.72 | 0.201 | 9.59 | 0.188 | 6.44 | 0.157 |
CAM++ | 7.43 | 0.196 | 8.62 | 0.192 | 17.22 | 0.334 |
MFA-Conformer | 9.25 | 0.203 | 9.32 | 0.188 | 6.99 | 0.161 |
Wespeaker | 12.43 | 0.208 | 12.04 | 0.195 | 10.26 | 0.174 |
RedimNet | 7.97 | 0.206 | 8.39 | 0.187 | 6.53 | 0.160 |
Gemini | 7.77 | 0.202 | 8.51 | 0.191 | 8.86 | 0.166 |
Hu + EC | 6.65 | 0.197 | 8.57 | 0.191 | 6.88 | 0.159 |
Hu + EC + FAM | 6.49 | 0.196 | 7.95 | 0.185 | 6.13 | 0.154 |
S-Hu + EC + FAM | 6.73 | 0.195 | 8.48 | 0.189 | 6.36 | 0.162 |
T-Hu + EC + FAM | 7.20 | 0.202 | 8.62 | 0.191 | 7.02 | 0.163 |
Table 10.
VoxCeleb1-A/B and cn-celeb_v2-A/B setup (percentage: refers to the proportion to the original unlabeled speech(VoxCeleb1-S or cn-celeb_v2-S)).
Table 10.
VoxCeleb1-A/B and cn-celeb_v2-A/B setup (percentage: refers to the proportion to the original unlabeled speech(VoxCeleb1-S or cn-celeb_v2-S)).
| Percentage (%) | Training Set Amount | Test Set Amount |
---|
VoxCeleb1-A | 8 | 8098 | 142 |
VoxCeleb1-B | 40.2 | 40,398 | 478 |
cn-celeb_v2-A | 14.5 | 8035 | 155 |
cn-celeb_v2-B | 49 | 26,618 | 570 |
Table 11.
PNR-TDNN’s best EER and MinDCF in other public datasets (Note:EC=ECAPA-TDNN, S-Hu = small HuBERT, T-Hu = tiny HuBERT, Hu = HuBERT, FAM= fusion Res2net + SE-attentive mean + multi-head attentive statistics pooling).
Table 11.
PNR-TDNN’s best EER and MinDCF in other public datasets (Note:EC=ECAPA-TDNN, S-Hu = small HuBERT, T-Hu = tiny HuBERT, Hu = HuBERT, FAM= fusion Res2net + SE-attentive mean + multi-head attentive statistics pooling).
Methodology | VoxCeleb1-A/B
(Noise) | VoxCeleb1-A/B
(Noise and Reverb) | cnceleb_v2-A/B
(Noise) | cnceleb_v2-A/B
(Noise and Reverb) |
---|
EER (%) | MinDCF | EER (%) | MinDCF | EER (%) | MinDCF | EER (%) | MinDCF |
---|
EC | 6.87/13.5 | 0.250/0.331 | 18.0/23.3 | 0.308/0.509 | 12.6/8.68 | 0.201/0.334 | 14.1/14.0 | 0.241/0.438 |
CAM++ | 15.9/15.5 | 0.306/0.409 | 34.7/59.7 | 0.403/0.930 | 13.8/7.52 | 0.235/0.310 | 16.2/13.6 | 0.241/0.426 |
MFA-Conformer | 17.1/15.5 | 0.329/0.421 | 19.3/24.2 | 0.356/0.573 | 13.4/11.9 | 0.241/0.504 | 15.8/18.7 | 0.255/0.560 |
Wespeaker | 13.9/25.7 | 0.319/0.582 | 22.2/34.7 | 0.351/0.624 | 16.6/20.3 | 0.246/0.768 | 18.0/27.3 | 0.270/0.818 |
RedimNet | 13.9/15.1 | 0.306/0.412 | 16.4/25.6 | 0.335/0.580 | 14.8/10.3 | 0.236/0.369 | 17.2/16.5 | 0.241/0.579 |
Gemini | 17.0/18.9 | 0.351/0.468 | 20.5/22.4 | 0.329/0.545 | 16.7/12.1 | 0.255/0.493 | 19.9/16.9 | 0.266/0.620 |
Hu + EC | 4.62/9.96 | 0.247/0.212 | 17.1/21.7 | 0.304/0.536 | 12.2/7.15 | 0.203/0.309 | 14.2/13.2 | 0.244/0.445 |
Hu + EC + FAM | 3.30/9.27 | 0.208/0.205 | 15.8/20.9 | 0.296/0.523 | 11.9/6.73 | 0.197/0.281 | 13.8/12.8 | 0.240/0.414 |
S-Hu + EC + FAM | 3.06/9.78 | 0.195/0.275 | 15.0/19.2 | 0.307/0.505 | 10.9/4.87 | 0.240/0.258 | 12.9/11.9 | 0.233/0.408 |
T-Hu + EC + FAM | 4.39/9.19 | 0.219/0.244 | 16.5/20.5 | 0.313/0.542 | 9.69/5.26 | 0.221/0.266 | 14.5/13.3 | 0.241/0.449 |