Figure 1.
EEND-EDA architecture overview showing alternative feature configurations. The model processes input audio through parallel streams: either standalone Mel-filterbanks () or a concatenated representation () combining Mel-filterbanks with auxiliary features (). The SA-EEND encoder generates embeddings (), which the EDA module transforms into speaker-specific attractors () for final diarization output.
Figure 1.
EEND-EDA architecture overview showing alternative feature configurations. The model processes input audio through parallel streams: either standalone Mel-filterbanks () or a concatenated representation () combining Mel-filterbanks with auxiliary features (). The SA-EEND encoder generates embeddings (), which the EDA module transforms into speaker-specific attractors () for final diarization output.
Figure 2.
Distribution of audio duration in hours by number of speakers for CallHome training (CH1) and testing (CH2) datasets.
Figure 2.
Distribution of audio duration in hours by number of speakers for CallHome training (CH1) and testing (CH2) datasets.
Figure 3.
Results obtained on CH2-2spk dataset using different features. Shaded regions indicate experiments where the corresponding features were concatenated with MFB features. For each feature type. the bars show the DER with and without Oracle VAD, as well as with and without domain adaptation (striped bars indicate experiments with domain adaptation, solid bars without domain adaptation).
Figure 3.
Results obtained on CH2-2spk dataset using different features. Shaded regions indicate experiments where the corresponding features were concatenated with MFB features. For each feature type. the bars show the DER with and without Oracle VAD, as well as with and without domain adaptation (striped bars indicate experiments with domain adaptation, solid bars without domain adaptation).
Figure 4.
The degradation is measured as DER% when computing the mean of each feature group per audio file during inference.
Figure 4.
The degradation is measured as DER% when computing the mean of each feature group per audio file during inference.
Figure 5.
Performance comparison of different GeMAPS subsets and context sizes on the CH2 2-spk set, with adaptation on CH1 2-spk set. Three feature configurations are compared: Complete GeMAPS, F0 + Formants, and a Reduced Set, across varying context sizes from 0 to 8.
Figure 5.
Performance comparison of different GeMAPS subsets and context sizes on the CH2 2-spk set, with adaptation on CH1 2-spk set. Three feature configurations are compared: Complete GeMAPS, F0 + Formants, and a Reduced Set, across varying context sizes from 0 to 8.
Figure 6.
Performance comparison across feature configurations when adapting to different CH1 subsets. Radar plots show multiple evaluation metrics for: (a) MFB baseline, (b) ECAPA-TDNN ⊕ MFB, and (c) GeMAPS ⊕ MFB. Each plot compares adaptation using the complete CH1 all set (2–7 speakers) versus using only conversations with 2–4 speakers. Beside each metric, arrows indicate whether lower (↓) or higher (↑) values are better. Values in parentheses show the metric scores for (CH1 all, CH1-2–4spk) respectively. Numerical ranges shown in brackets [min–max] for each axis are individually adjusted per metric to highlight relative differences between adaptation strategies. Metrics displayed include DER, SCE, Speaker Count Accuracy, Turn Detection F1, and OSD F1.
Figure 6.
Performance comparison across feature configurations when adapting to different CH1 subsets. Radar plots show multiple evaluation metrics for: (a) MFB baseline, (b) ECAPA-TDNN ⊕ MFB, and (c) GeMAPS ⊕ MFB. Each plot compares adaptation using the complete CH1 all set (2–7 speakers) versus using only conversations with 2–4 speakers. Beside each metric, arrows indicate whether lower (↓) or higher (↑) values are better. Values in parentheses show the metric scores for (CH1 all, CH1-2–4spk) respectively. Numerical ranges shown in brackets [min–max] for each axis are individually adjusted per metric to highlight relative differences between adaptation strategies. Metrics displayed include DER, SCE, Speaker Count Accuracy, Turn Detection F1, and OSD F1.
Figure 7.
Impact of dropout rate on DER performance across different feature configurations. Results compare adaptation using complete CH1 all (2–7 speakers) versus restricted set (2–4 speakers). Each subplot shows a different feature configuration: (left) MFB baseline features, (middle) ECAPA-TDNN ⊕ MFB features, and (right) GeMAPS ⊕ MFB features.
Figure 7.
Impact of dropout rate on DER performance across different feature configurations. Results compare adaptation using complete CH1 all (2–7 speakers) versus restricted set (2–4 speakers). Each subplot shows a different feature configuration: (left) MFB baseline features, (middle) ECAPA-TDNN ⊕ MFB features, and (right) GeMAPS ⊕ MFB features.
Figure 8.
Impact of weighted batch sampling on speaker counting performance across different feature configurations and adaptation strategies. Results compare standard sampling versus weighted sampling for both complete (CH1 all) and restricted (CH1 2–4 spk) adaptation sets. Top: Speaker Count Accuracy (%). Bottom: Speaker Count Error.
Figure 8.
Impact of weighted batch sampling on speaker counting performance across different feature configurations and adaptation strategies. Results compare standard sampling versus weighted sampling for both complete (CH1 all) and restricted (CH1 2–4 spk) adaptation sets. Top: Speaker Count Accuracy (%). Bottom: Speaker Count Error.
Figure 9.
Impact of label smoothing and subsampling rate applied during adaptation on system performance across different feature configurations. Top row shows DER (%) variations; bottom row shows Speaker Turn Detection F1-score changes. Each column represents a different feature configuration.
Figure 9.
Impact of label smoothing and subsampling rate applied during adaptation on system performance across different feature configurations. Top row shows DER (%) variations; bottom row shows Speaker Turn Detection F1-score changes. Each column represents a different feature configuration.
Figure 10.
Performance gains from enhanced temporal resolution across feature configurations, with consistent regularization (dropout = 0.3) and adaptation strategy (CH1 2–4 spk). Left: Diarization Error Rate (DER), where lower values indicate better performance. Right: Speaker Turn Detection F1-score, where higher values indicate better performance. Green percentages indicate relative improvements.
Figure 10.
Performance gains from enhanced temporal resolution across feature configurations, with consistent regularization (dropout = 0.3) and adaptation strategy (CH1 2–4 spk). Left: Diarization Error Rate (DER), where lower values indicate better performance. Right: Speaker Turn Detection F1-score, where higher values indicate better performance. Green percentages indicate relative improvements.
Table 1.
Distribution of audio recordings by number of speakers in the CallHome database for training (CH1) and testing (CH2) datasets. The columns represent the number of speakers per recording, while values indicate the total count of audio files in each category.
Table 1.
Distribution of audio recordings by number of speakers in the CallHome database for training (CH1) and testing (CH2) datasets. The columns represent the number of speakers per recording, while values indicate the total count of audio files in each category.
Dataset | Number of Speakers | Total |
---|
2
| 3
| 4 | 5 | 6 | 7 |
---|
CH1 (Train) | 155 | 61 | 23 | 5 | 3 | 2 | 249 |
CH2 (Test) | 148 | 74 | 20 | 5 | 3 | 0 | 250 |
Table 2.
Diarization Error Rate (DER) components for best feature configurations. For ECAPA-TDNN concatenated with MFB features, oracle VAD was employed during both training and inference phases. Models are adapted to CH1-2spk and evaluated on CH2-2spk.
Table 2.
Diarization Error Rate (DER) components for best feature configurations. For ECAPA-TDNN concatenated with MFB features, oracle VAD was employed during both training and inference phases. Models are adapted to CH1-2spk and evaluated on CH2-2spk.
Feature Type | Conditions | Adaptation | DER Components (%) |
---|
Miss | FA | Spk. Error | Total |
---|
MFB (baseline) | No VAD | ✗ | 2.50 | 6.60 | 0.99 | 10.09 |
✓ | 3.40 | 3.87 | 0.80 | 8.07 |
ECAPA-TDNN ⊕ MFB | Oracle VAD | ✗ | 3.55 | 3.10 | 3.21 | 9.86 |
✓ | 3.0 | 3.05 | 1.15 | 7.20 |
GeMAPS ⊕ MFB | No VAD | ✗ | 3.50 | 5.70 | 1.55 | 10.75 |
✓ | 5.74 | 2.40 | 0.90 | 9.04 |
Table 3.
Performance comparison of different GeMAPS feature configurations on CH2-2spk. The Proposed Reduced Set excludes voiced segments and jitter/shimmer features based on the ablation study results.
Table 3.
Performance comparison of different GeMAPS feature configurations on CH2-2spk. The Proposed Reduced Set excludes voiced segments and jitter/shimmer features based on the ablation study results.
Configuration | Number of Features | DER (%) | DER(%) |
---|
Complete GeMAPS | 62 | 13.60 | - |
F0 + Formants | 24 | 45.33 | +31.73% |
Proposed Reduced Set | 52 | 13.93 | +0.33% |
Table 4.
Performance comparison between the original baseline [
36] and our implementation on CH2 evaluation set. Results are shown for different training configurations: two-speaker simulated conversations (SC 2 spkr), one-to-four speaker simulated conversations (SC 1–4 spkr), and after adaptation using the complete CH1 set (CH1 all). DER (%) is reported for the complete evaluation set (All) and broken down by the number of speakers in each recording.
Table 4.
Performance comparison between the original baseline [
36] and our implementation on CH2 evaluation set. Results are shown for different training configurations: two-speaker simulated conversations (SC 2 spkr), one-to-four speaker simulated conversations (SC 1–4 spkr), and after adaptation using the complete CH1 set (CH1 all). DER (%) is reported for the complete evaluation set (All) and broken down by the number of speakers in each recording.
| Training Set | All | 2 spk | 3 spk | 4 spk | 5 spk | 6 spk |
---|
Baseline | SC 2 spkr | 20.86 | 8.48 | 21.07 | 29.56 | 45.61 | 49.2 |
SC 1–4 spkr | 16.18 | 8.95 | 13.78 | 21.22 | 37.35 | 46.32 |
CH1 all | 16.07 | 10.03 | 14.35 | 19.3 | 30.67 | 46.94 |
Baseline (ours) | SC 2 spkr | 22.24 | 10.33 | 22.71 | 30.64 | 49.18 | 48.65 |
SC 1–4 spkr | 18.28 | 9.91 | 16.47 | 24.29 | 45.88 | 46.35 |
CH1 all | 16.8 | 7.99 | 15.42 | 23.47 | 37.17 | 41.07 |
Table 5.
Performance metrics for our baseline implementation after CH1 adaptation on the CH2 evaluation set. Results demonstrate the system’s capabilities across different aspects of the diarization task.
Table 5.
Performance metrics for our baseline implementation after CH1 adaptation on the CH2 evaluation set. Results demonstrate the system’s capabilities across different aspects of the diarization task.
Task | Metric | Performance |
---|
Speaker Count | Accuracy (%) | 73.20 |
Speaker Counting Error | 0.348 |
Speaker Turn Detection | Precision (%) | 62.41 |
Recall (%) | 46.31 |
F1-score (%) | 53.16 |
Overlap Speech Detection | Precision (%) | 64.90 |
Recall (%) | 42.40 |
F1-score (%) | 51.30 |
Accuracy (%) | 87.95 |
Table 6.
DER (%) performance comparison on CH2 evaluation set across different feature configurations at training stage. Results are broken down by the number of speakers in each recording.
Table 6.
DER (%) performance comparison on CH2 evaluation set across different feature configurations at training stage. Results are broken down by the number of speakers in each recording.
| Training Set | All | 2 spk | 3 spk | 4 spk | 5 spk | 6 spk |
---|
Baseline (ours) | SC 2spkr | 22.24 | 10.33 | 22.71 | 30.64 | 49.18 | 48.65 |
SC 1–4 spkr | 18.28 | 9.91 | 16.47 | 24.29 | 45.88 | 46.35 |
CH1 all | 16.8 | 7.99 | 15.42 | 23.47 | 37.17 | 41.07 |
ECAPA-TDNN ⊕ MFB | SC 2spkr | 26.66 | 12.31 | 26.12 | 38.41 | 44.22 | 57.37 |
SC 1–4 spkr | 27.41 | 14.6 | 24.24 | 41.09 | 52.91 | 54.82 |
CH1 all | 20.95 | 10.45 | 19.00 | 31.79 | 38.44 | 51.32 |
GeMAPS ⊕ MFB | SC 2spkr | 24.90 | 10.75 | 27.45 | 36.39 | 48.50 | 54.02 |
SC 1–4 spkr | 22.91 | 11.01 | 23.99 | 30.56 | 45.23 | 49.15 |
CH1 all | 23.21 | 14.34 | 22.49 | 29.38 | 42.27 | 49.44 |
Table 7.
Comprehensive performance metrics comparison across different feature configurations on the CH2 evaluation set after CH1 adaptation. Results show that while feature combinations improve certain aspects of performance, they introduce different trade-offs in speaker discrimination capabilities.
Table 7.
Comprehensive performance metrics comparison across different feature configurations on the CH2 evaluation set after CH1 adaptation. Results show that while feature combinations improve certain aspects of performance, they introduce different trade-offs in speaker discrimination capabilities.
Task | Metric | MFB | ECAPA-TDNN ⊕ MFB | GeMAPS ⊕ MFB |
---|
Speaker Count | Accuracy (%) | 73.20 | 64.00 | 66.40 |
SCE | 0.348 | 0.444 | 0.428 |
Speaker Turn Det. | Precision (%) | 62.41 | 56.20 | 54.58 |
Recall (%) | 46.31 | 37.97 | 43.17 |
F1-score (%) | 53.16 | 45.32 | 48.21 |
Overlap Speech Det. | Precision (%) | 64.90 | 42.50 | 55.80 |
Recall (%) | 42.40 | 45.60 | 36.30 |
F1-score (%) | 51.30 | 44.00 | 44.00 |
Accuracy (%) | 87.95 | 82.64 | 86.17 |
Table 8.
Performance comparison of different dropout rates across multiple metrics for MFB features. Results are shown for both complete (CH1 all) and restricted (CH1 2–4 spk) adaptation sets. P indicates Precision and R indicates Recall.
Table 8.
Performance comparison of different dropout rates across multiple metrics for MFB features. Results are shown for both complete (CH1 all) and restricted (CH1 2–4 spk) adaptation sets. P indicates Precision and R indicates Recall.
Adaptation | Dropout | Speaker Count | Speaker Turn Detection | Overlap Speech Detection |
---|
SCE | Acc. | P | R | F1 | P | R | F1 | Acc. |
---|
CH1 all | 0.1 | 0.348 | 73.6 | 62.41 | 46.31 | 53.16 | 64.9 | 42.4 | 51.3 | 87.95 |
0.3 | 0.332 | 73.2 | 61.94 | 48.07 | 54.13 | 65.6 | 42.0 | 51.2 | 88.03 |
0.5 | 0.436 | 66.0 | 60.20 | 45.06 | 51.57 | 69.4 | 19.4 | 30.4 | 86.67 |
CH1 2–4 spk | 0.1 | 0.320 | 74.4 | 62.63 | 47.51 | 53.97 | 64.4 | 44.5 | 52.6 | 88.02 |
0.3 | 0.312 | 74.8 | 62.68 | 49.41 | 55.26 | 66.5 | 42.0 | 51.5 | 88.16 |
0.5 | 0.364 | 68.0 | 59.36 | 47.70 | 52.85 | 61.8 | 25.9 | 36.5 | 86.52 |
Table 9.
Performance comparison of different dropout rates across multiple metrics for ECAPA-TDNN ⊕ MFB features. Results are shown for both complete (CH1 all) and restricted (CH1 2–4 spk) adaptation sets. P indicates Precision and R indicates Recall.
Table 9.
Performance comparison of different dropout rates across multiple metrics for ECAPA-TDNN ⊕ MFB features. Results are shown for both complete (CH1 all) and restricted (CH1 2–4 spk) adaptation sets. P indicates Precision and R indicates Recall.
Adaptation | Dropout | Speaker Count | Speaker Turn Detection | Overlap Speech Detection |
---|
SCE | Acc. | P | R | F1 | P | R | F1 | Acc. |
---|
CH1 all | 0.1 | 0.444 | 64.0 | 56.20 | 37.97 | 45.32 | 42.5 | 45.6 | 44.0 | 82.64 |
0.3 | 0.384 | 70.0 | 58.86 | 38.69 | 46.69 | 43.3 | 45.1 | 44.2 | 82.96 |
0.5 | 0.592 | 57.2 | 55.33 | 36.10 | 43.69 | 41.2 | 44.3 | 42.7 | 81.32 |
CH1 2–4 spk | 0.1 | 0.376 | 68.4 | 59.84 | 40.54 | 48.33 | 45.2 | 47.0 | 46.1 | 83.55 |
0.3 | 0.384 | 68.8 | 60.41 | 41.06 | 48.89 | 50.5 | 41.3 | 45.4 | 85.16 |
0.5 | 0.464 | 62.8 | 57.06 | 38.17 | 45.74 | 44.3 | 37.4 | 40.6 | 83.61 |
Table 10.
Performance comparison of different dropout rates across multiple metrics for GeMAPS ⊕ MFB features. Results are shown for both complete (CH1 all) and restricted (CH1 2–4 spk) adaptation sets. P indicates Precision and R indicates Recall.
Table 10.
Performance comparison of different dropout rates across multiple metrics for GeMAPS ⊕ MFB features. Results are shown for both complete (CH1 all) and restricted (CH1 2–4 spk) adaptation sets. P indicates Precision and R indicates Recall.
Adaptation | Dropout | Speaker Count | Speaker Turn Detection | Overlap Speech Detection |
---|
SCE | Acc. | P | R | F1 | P | R | F1
| Acc. |
---|
CH1 all | 0.1 | 0.428 | 66.4 | 54.58 | 43.17 | 48.21 | 55.8 | 36.3 | 44.0 | 86.17 |
0.3 | 0.524 | 59.2 | 56.77 | 43.51 | 49.26 | 38.4 | 33.6 | 35.9 | 82.02 |
0.5 | 0.636 | 50.0 | 51.58 | 38.21 | 43.90 | 53.0 | 15.1 | 23.4 | 85.30 |
CH1 2–4 spk | 0.1 | 0.415 | 67.3 | 53.40 | 42.12 | 47.09 | 56.2 | 37.9 | 45.3 | 86.29 |
0.3 | 0.328 | 73.2 | 60.63 | 47.08 | 53.00 | 66.2 | 29.9 | 41.2 | 87.22 |
0.5 | 0.396 | 65.2 | 55.74 | 45.97 | 50.39 | 61.0 | 21.6 | 31.9 | 86.21 |
Table 11.
Performance comparison of weighted batch sampling across multiple metrics for MFB features. Results are shown for both complete (CH1 all) and restricted (CH1 2–4 spk) adaptation sets. Speaker counting metrics are shown in
Figure 8. P indicates Precision and R indicates Recall.
Table 11.
Performance comparison of weighted batch sampling across multiple metrics for MFB features. Results are shown for both complete (CH1 all) and restricted (CH1 2–4 spk) adaptation sets. Speaker counting metrics are shown in
Figure 8. P indicates Precision and R indicates Recall.
Adaptation | Strategy | DER Components (%) | Speaker Turn Detection | Overlap Speech Detection |
---|
Miss | FA
| Spk. Error | Total | P | R | F1 | P | R | F1 |
---|
CH1 all | No Weight | 9.0 | 2.6 | 5.2 | 16.8 | 62.41 | 46.31 | 53.16 | 64.9 | 42.4 | 51.3 |
Weighted | 7.7 | 3.4 | 5.8 | 16.93 | 57.85 | 45.30 | 50.81 | 62.0 | 42.7 | 50.6 |
CH1 2–4 spk | No Weight | 8.0 | 2.7 | 4.9 | 15.64 | 62.63 | 47.51 | 53.97 | 64.4 | 44.5 | 52.6 |
Weighted | 11.1 | 2.2 | 5.8 | 19.08 | 62.24 | 46.98 | 53.51 | 65.1 | 31.4 | 42.3 |
Table 12.
Performance comparison of weighted batch sampling across multiple metrics for ECAPA-TDNN ⊕ MFB features. Results are shown for both complete (CH1 all) and restricted (CH1 2–4 spk) adaptation sets. Speaker counting metrics are shown in
Figure 8. P indicates Precision and R indicates Recall.
Table 12.
Performance comparison of weighted batch sampling across multiple metrics for ECAPA-TDNN ⊕ MFB features. Results are shown for both complete (CH1 all) and restricted (CH1 2–4 spk) adaptation sets. Speaker counting metrics are shown in
Figure 8. P indicates Precision and R indicates Recall.
Adaptation | Strategy | DER Components (%) | Speaker Turn Detection | Overlap Speech Detection |
---|
Miss | FA | Spk. Error | Total | P | R | F1 | P | R | F1 |
---|
CH1 all | No Weight | 5.0 | 8.4 | 7.6 | 20.95 | 56.20 | 37.97 | 45.32 | 42.5 | 45.6 | 44.0 |
Weighted | 8.2 | 7.7 | 8.0 | 23.86 | 55.45 | 36.88 | 44.30 | 40.5 | 44.6 | 42.5 |
CH1 2–4 spk | No Weight | 5.3 | 6.3 | 5.5 | 17.09 | 59.84 | 40.54 | 48.33 | 45.2 | 47.0 | 46.1 |
Weighted | 5.3 | 6.2 | 5.3 | 16.87 | 59.91 | 40.56 | 48.37 | 45.5 | 47.5 | 46.5 |
Table 13.
Performance comparison of weighted batch sampling across multiple metrics for GeMAPS ⊕ MFB features. Results are shown for both complete (CH1 all) and restricted (CH1 2–4 spk) adaptation sets. Speaker counting metrics are shown in
Figure 8. P indicates Precision and R indicates Recall.
Table 13.
Performance comparison of weighted batch sampling across multiple metrics for GeMAPS ⊕ MFB features. Results are shown for both complete (CH1 all) and restricted (CH1 2–4 spk) adaptation sets. Speaker counting metrics are shown in
Figure 8. P indicates Precision and R indicates Recall.
Adaptation | Strategy | DER Components (%) | Speaker Turn Detection | Overlap Speech Detection |
---|
Miss | FA | Spk. Error | Total | P | R | F1 | P | R | F1 |
---|
CH1 all | No Weight | 10.0 | 5.2 | 8.0 | 23.21 | 54.58 | 43.17 | 48.21 | 55.8 | 36.3 | 44.0 |
Weighted | 17.6 | 3.0 | 7.8 | 28.35 | 54.78 | 40.57 | 46.62 | 55.3 | 26.1 | 35.5 |
CH1 2–4 spk | No Weight | 9.6 | 4.8 | 8.1 | 22.50 | 53.40 | 42.12 | 47.09 | 56.2 | 37.9 | 45.3 |
Weighted | 12.3 | 3.9 | 5.8 | 22.09 | 59.95 | 45.34 | 51.63 | 55.2 | 33.9 | 42.0 |
Table 14.
Performance comparison of label smoothing and subsampling rate across multiple metrics for MFB features. Results are shown for restricted adaptation set (CH1 2–4 spk). P indicates Precision and R indicates Recall.
Table 14.
Performance comparison of label smoothing and subsampling rate across multiple metrics for MFB features. Results are shown for restricted adaptation set (CH1 2–4 spk). P indicates Precision and R indicates Recall.
Strategy | Subsampling | DER Components (%) | Speaker Turn Detection | Overlap Speech Detection |
---|
Miss | FA | Spk. Error | Total | P | R | F1 | P | R | F1 |
---|
No Smooth | 10 | 8.0 | 2.7 | 4.9 | 15.64 | 62.63 | 47.51 | 53.97 | 64.4 | 44.5 | 52.6 |
5 | 8.3 | 2.6 | 5.1 | 16.0 | 63.5 | 48.0 | 54.7 | 64.8 | 45.0 | 53.0 |
Smooth | 10 | 7.8 | 4.2 | 5.4 | 17.39 | 60.60 | 44.70 | 51.44 | 56.0 | 47.0 | 51.4 |
5 | 8.0 | 2.8 | 5.7 | 16.43 | 62.49 | 46.97 | 53.64 | 63.1 | 44.9 | 52.5 |
Table 15.
Performance comparison of label smoothing and subsampling rate across multiple metrics for ECAPA-TDNN ⊕ MFB features. Results are shown for restricted adaptation set (CH1 2–4 spk). P indicates Precision and R indicates Recall.
Table 15.
Performance comparison of label smoothing and subsampling rate across multiple metrics for ECAPA-TDNN ⊕ MFB features. Results are shown for restricted adaptation set (CH1 2–4 spk). P indicates Precision and R indicates Recall.
Strategy | Subsampling | DER Components (%) | Speaker Turn Detection | Overlap Speech Detection |
---|
Miss | FA | Spk. Error | Total | P | R | F1 | P | R | F1 |
---|
No Smooth | 10 | 5.3 | 6.3 | 5.5 | 17.09 | 59.84 | 40.54 | 48.33 | 45.2 | 47.0 | 46.1 |
5 | 5.2 | 6.2 | 5.4 | 16.8 | 60.5 | 41.0 | 49.0 | 45.5 | 47.5 | 46.5 |
Smooth | 10 | 6.9 | 7.4 | 5.7 | 19.96 | 58.69 | 38.19 | 46.27 | 42.3 | 43.4 | 42.8 |
5 | 5.1 | 6.0 | 5.3 | 16.4 | 61.0 | 41.5 | 49.5 | 46.0 | 48.0 | 47.0 |
Table 16.
Performance comparison of label smoothing and subsampling rate across multiple metrics for GeMAPS ⊕ MFB features. Results are shown for restricted adaptation set (CH1 2–4 spk). P indicates Precision and R indicates Recall.
Table 16.
Performance comparison of label smoothing and subsampling rate across multiple metrics for GeMAPS ⊕ MFB features. Results are shown for restricted adaptation set (CH1 2–4 spk). P indicates Precision and R indicates Recall.
Strategy | Subsampling | DER Components (%) | Speaker Turn Detection | Overlap Speech Detection |
---|
Miss | FA | Spk. Error | Total | P | R | F1 | P | R | F1 |
---|
No Smooth | 10 | 9.6 | 4.8 | 8.1 | 22.5 | 53.40 | 42.12 | 47.09 | 56.2 | 37.9 | 45.3 |
5 | 11.8 | 3.0 | 6.0 | 20.8 | 60.5 | 45.8 | 52.2 | 60.5 | 31.0 | 41.0 |
Smooth | 10 | 11.4 | 3.3 | 5.8 | 20.52 | 59.93 | 45.01 | 51.41 | 59.9 | 33.5 | 42.9 |
5 | 11.6 | 2.8 | 5.9 | 20.3 | 60.94 | 46.35 | 52.65 | 61.0 | 31.5 | 41.5 |
Table 17.
Error component analysis for ECAPA-TDNN ⊕ MFB and GeMAPS ⊕ MFB with enhanced temporal resolution (subsampling = 5). All configurations use dropout = 0.3 and restricted adaptation (CH1 2–4 spk).
Table 17.
Error component analysis for ECAPA-TDNN ⊕ MFB and GeMAPS ⊕ MFB with enhanced temporal resolution (subsampling = 5). All configurations use dropout = 0.3 and restricted adaptation (CH1 2–4 spk).
Feature | DER Components (%) |
---|
Configuration | Miss | FA | Speaker Error | Total |
MFB | 8.9 | 2.1 | 4.7 | 15.76 |
ECAPA-TDNN ⊕ MFB | 5.7 | 3.7 | 5.5 | 14.89 |
GeMAPS ⊕ MFB | 12.3 | 1.6 | 5.0 | 18.87 |
Table 18.
Performance comparison of optimal configurations in terms of DER% for each feature type. All configurations use restricted adaptation (CH1 2–4 spk), dropout = 0.3, and subsampling = 5.
Table 18.
Performance comparison of optimal configurations in terms of DER% for each feature type. All configurations use restricted adaptation (CH1 2–4 spk), dropout = 0.3, and subsampling = 5.
Feature | Error Components (%) | DER | Speaker Count | Turn Det. | OSD |
---|
Configuration | Miss | FA | Spk. Error | (%) | Acc. (%) | F1 (%) | F1 (%) |
MFB | 8.0 | 2.7 | 4.9 | 15.64 | 74.4 | 53.97 | 52.6 |
ECAPA-TDNN ⊕ MFB | 5.7 | 3.7 | 5.5 | 14.89 | 68.8 | 50.53 | 47.0 |
GeMAPS ⊕ MFB | 12.3 | 1.6 | 5.0 | 18.87 | 73.2 | 52.65 | 41.5 |
Table 19.
Performance comparison across adaptation enhancements for each feature configuration. Results show DER (%) and relative improvement compared to the standard configuration (CH1 all).
Table 19.
Performance comparison across adaptation enhancements for each feature configuration. Results show DER (%) and relative improvement compared to the standard configuration (CH1 all).
Configuration | MFB | ECAPA-TDNN ⊕ MFB | GeMAPS ⊕ MFB |
---|
Standard (CH1 all) | 16.80 | 20.95 | 23.21 |
Restricted adaptation (CH1 2–4 spk) | 15.64 (−6.9%) | 17.09 (−18.4%) | 22.50 (−3.1%) |
Restricted adapt. + dropout (0.3) | 16.17 (−3.8%) | 15.80 (−24.6%) | 19.74 (−14.9%) |
All optimizations (subsampling = 5) | 15.76 (−6.2%) | 14.89 (−29.0%) | 18.87 (−18.7%) |
* Full enhancement stack * | 16.16 (−3.8%) | 15.00 (−28.4%) | 19.48 (−16.1%) |
Table 20.
Performance comparison between our optimized systems and state-of-the-art approaches on the CH2 evaluation set. Results are shown for both collared (250 ms) and collar-free evaluation, applying different subsampling factors during inference for our proposed systems.
Table 20.
Performance comparison between our optimized systems and state-of-the-art approaches on the CH2 evaluation set. Results are shown for both collared (250 ms) and collar-free evaluation, applying different subsampling factors during inference for our proposed systems.
System | Collar | DER (%) |
---|
Subsampling = 10, Median = 11 | Subsampling = 5, Median = 5 |
---|
MFB | ✓ | 15.72 | 16.76 |
✗ | 26.06 | 23.95 |
ECAPA-TDNN ⊕ MFB | ✓ | 14.89 | 15.59 |
✗ | 25.78 | 22.69 |
GeMAPS ⊕ MFB | ✓ | 18.87 | 23.25 |
✗ | 29.19 | 26.94 |
VBx [9] | ✓ | 14.21 |
✗ | 21.77 |
AHC [9] | ✓ | 17.64 |
✗ | 25.61 |
Pyannote [44] | ✓ | - |
✗ | 29.30 |