Feature Integration Strategies for Neural Speaker Diarization in Conversational Telephone Speech

Alvarez-Trejos, Juan Ignacio; Lozano-Diez, Alicia; Ramos, Daniel

doi:10.3390/app15094842

Open AccessArticle

Feature Integration Strategies for Neural Speaker Diarization in Conversational Telephone Speech

by

Juan Ignacio Alvarez-Trejos

^†

,

Alicia Lozano-Diez

^†

and

Daniel Ramos

^*

AUDIAS, Electronic and Communication Technology Department, Escuela Politécnica Superior, Universidad Autónoma de Madrid, Av. Francisco Tomás y Valiente, 11, 28049 Madrid, Spain

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(9), 4842; https://doi.org/10.3390/app15094842 (registering DOI)

Submission received: 28 February 2025 / Revised: 21 April 2025 / Accepted: 23 April 2025 / Published: 27 April 2025

(This article belongs to the Special Issue Applied Audio Interaction)

Download

Browse Figures

Versions Notes

Abstract

:

This paper addresses the challenge of optimizing end-to-end neural diarization systems for conversational telephone speech, focusing on diverse acoustic features beyond traditional Mel-filterbanks. We present a methodological framework for integrating and analyzing different feature types as input to the well-known End-to-End Neural Diarization with Encoder Decoder Attractors (EEND-EDA) model, focusing on Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network (ECAPA-TDNN) embeddings and Geneva Minimalistic Acoustic Parameter Sets (GeMAPS). Our approach combines systematic feature analysis with adaptation strategies, including speaker-count restriction and regularization techniques. Moreover, through comprehensive ablation studies of GeMAPS features, we identify optimal acoustic parameters and temporal contexts for diarization tasks, achieving a reduced feature set that maintains performance while decreasing computational complexity. Experiments on the CallHome corpus demonstrate that our optimized ECAPA-TDNN with Mel-filterbank combination reduces Diarization Error Rate by 29% relative to baseline systems. Our evaluation framework extends beyond traditional metrics, revealing that different feature combinations exhibit distinct strengths in specific diarization aspects.

Keywords:

end-to-end neural diarization (EEND); acoustic features; ECAPA-TDNN; GeMAPS

1. Introduction

Speaker diarization refers to the task that automatically identifies and distinguishes between different speakers in an audio recording by analyzing unique voice characteristics. This technology enables numerous practical applications across diverse domains: in media production, it facilitates automatic subtitle generation with speaker identification [1]; in meeting transcription services, it creates navigable records with labeled speaker turns [2]; in forensic analysis, it helps identify speakers in courtroom recordings or surveillance audio [3]; in healthcare, it supports medical transcription with proper attribution of physician and patient speech [4,5]; and in call center analytics, it enables evaluation of agent-customer interactions [6]. Speaker diarization also enhances automatic speech recognition (ASR) systems by providing speaker-specific acoustic models and enables long-form audio content indexing for efficient retrieval of specific speakers’ contributions [7].

The evolution of diarization systems has seen multiple paradigms. Traditional approaches rely on clustering speaker embeddings using methods like Agglomerative Hierarchical Clustering [8] or the more sophisticated VBx [9], segmenting audio into speech and non-speech chunks using Voice Activity Detection (VAD) [10] before extracting speaker representations, such as i-vectors [11], x-vectors [8], or ECAPA-TDNN embeddings [12]. These speaker representations, while effective, are not inherently trained to handle silence segments and overlapped speech [13], necessitating careful consideration of VAD in the diarization pipeline. Methods often face challenges with the trade-offs in embedding extraction window sizes [14].

The field has witnessed significant advancement with the introduction of end-to-end neural approaches [15,16], which directly address the limitations of traditional methods. End-to-end approaches, particularly the Self-Attentive End-to-End Neural Diarization (SA-EEND) [17,18], effectively handle overlapped speech and silence segments by treating diarization as a multi-label classification problem. The introduction of EEND-EDA (End-to-End Neural Diarization with Encoder-Decoder based Attractors) [19] further enhanced these systems’ ability to handle varying numbers of speakers. We selected the EEND architecture for our investigation due to its established reliability, reproducibility, and adaptability to different feature representations. These models have formed a strong foundation for diarization research, with numerous subsequent advancements building upon their framework. Additionally, the availability of open-source implementations and synthetic data generation algorithms makes replication of these experiments more accessible to the research community, facilitating further innovation and validation of results. It is worth noting that while we use the term ’end-to-end’ to describe these models, our application incorporates pre-trained features like ECAPA-TDNN embeddings, which strictly speaking introduces a separate processing stage. In our context, ’end-to-end’ refers specifically to the direct processing from input features to speaker labels without intermediate clustering or segmentation steps.

Recent research has explored various strategies to improve diarization performance. Some approaches leverage automatic speech recognition for word-level speaker turn detection [20,21], while others propose hybrid systems combining end-to-end approaches with clustering algorithms [22,23]. Additional advances have examined the impact of different speaker representations on diarization performance [24].

In the context of feature combination strategies, the Target-Speaker Voice Activity Detection (TS-VAD) system [25] marked a significant advancement in speaker diarization by demonstrating the effectiveness of incorporating speaker-specific information through feature concatenation. First introduced in the CHiME-6 Challenge [26], where it achieved state-of-the-art performance and led to winning the competition, this approach enhanced speaker diarization by combining traditional Mel-filterbank features with i-vector speaker embeddings. The remarkable success of this feature integration strategy, which significantly outperformed other participating systems, has inspired various extensions and improvements [27,28], suggesting the potential benefits of exploring similar approaches with other types of acoustic features.

This approach of combining different feature sets has proven successful in related speech processing tasks, particularly in paralinguistic applications. Research demonstrated the effectiveness of integrating Mel-filterbank features with The Geneva Minimalistic Acoustic Parameter Sets (GeMAPS and eGeMAPS) [29,30] for emotion recognition tasks, achieving improved performance through their complementary characteristics. GeMAPS represent a particularly promising feature set for such integration. These standardized sets of Low-Level Descriptors (LLDs) (LLDs capture instantaneous acoustic properties of the signal (e.g., F0, energy, formants) at short time frames, whereas High-Level Descriptors aggregate these primary features through statistical functionals (mean, variance, percentiles, etc.) to represent broader temporal patterns and paralinguistic characteristics.) demonstrated remarkable effectiveness in emotion recognition and speaker state analysis [31,32], capturing a comprehensive range of voice characteristics including frequency-related, energy-related, spectral, and voice quality parameters. Their proven capability to discriminate speaker-specific traits suggests potential benefits for speaker diarization applications [13].

Building upon our previous work [33], this study presents a more comprehensive analysis of current state-of-the-art speaker diarization model performance, with a specific focus on conversational telephonic speech. The contributions of this work are summarized as follows:

A systematic investigation of GeMAPS feature sets in end-to-end diarization models, including comprehensive ablation studies to identify optimal acoustic parameters and their temporal contexts. This analysis provides the first detailed guidelines for effectively integrating these paralinguistic features into neural diarization frameworks.
A comprehensive investigation of EEND-EDA’s adaptability to different acoustic representations, demonstrating that careful hyperparameter optimization enables effective integration of both ECAPA-TDNN embeddings and GeMAPS features alongside traditional Mel-filterbank features, achieving significant performance improvements over conventional single-feature approaches.
Proposed adaptation strategies focusing on speaker-count restriction and enhanced regularization techniques, revealing that targeted optimization can yield substantial improvements in diarization performance across different feature types. Our approach achieves a 29% relative reduction in Diarization Error Rate through systematic hyperparameter tuning.
A detailed analysis of temporal resolution’s impact on diarization performance, challenging conventional subsampling approaches and demonstrating the importance of fine-grained temporal analysis for different feature combinations.
A multi-metric evaluation framework that extends beyond traditional Diarization Error Rate (DER), providing deeper insights into system behavior across speaker counting, turn detection, and overlap speech handling capabilities.

Evaluating on the CallHome corpus [34], a widely-used benchmark for conversational telephone speech, our systematic analysis of different feature combinations reveals their distinct strengths: ECAPA-TDNN with Mel-filterbank features excel in overlap speech detection and speaker transition identification, particularly important in natural dialogue, while GeMAPS-based configurations show superior performance in speaker detection precision despite channel variability. The complementary nature of these approaches, with their decorrelated error patterns and distinct performance profiles across different metrics, strongly motivates future work on system fusion strategies that could leverage the strengths of each feature representation. These findings not only advance our understanding of feature integration in neural diarization but also provide practical guidelines for feature selection in telephone speech applications.

The remainder of this paper is organized as follows: Section 2 details our methodological framework for feature processing and adaptation. Section 3 describes the experimental setup and evaluation criteria. Section 4 presents our results and analysis, and Section 5 concludes with a discussion of implications and future directions.

2. Methodology

2.1. Model Architecture

The EEND-EDA (End-to-End Neural Diarization with Encoder-Decoder based Attractors) architecture introduces a flexible framework for speaker diarization that supports two distinct processing paths, as illustrated in Figure 1. The system can operate either with traditional acoustic features alone or with an enhanced representation that combines multiple feature types [33]. The architecture comprises three fundamental components:

A Self-Attention Encoder (SA-EEND) that leverages transformer blocks to capture long-term temporal dependencies and speaker-discriminative patterns in the input features. The encoder generates a sequence of high-dimensional embeddings (e) that encode both acoustic and speaker characteristics.
An Encoder-Decoder Attractors module (EDA) that dynamically determines both the number of speakers and their unique representations in a unified process. This component processes encoder embeddings through an LSTM-based architecture to generate speaker-specific attractors ( $a_{s}$ ). For each potential attractor, it computes an existence probability and activates only those that exceed a learned threshold. The number of speakers is implicitly determined by counting activated attractors, eliminating the need for a separate classification step.
A dot product operation that works as a selection mechanism that effectively suppresses inactive speakers by producing near-zero values when the embedding and attractor are orthogonal. The resulting values are then processed through a sigmoid activation to produce binary speaker activity decisions.

2.2. Feature Processing Framework

The framework encompasses three categories of acoustic representations, each designed to capture different aspects of the speech signal:

MFB—Traditional Features: The foundation of our feature framework relies on Mel-filterbank coefficients, which provide a perceptually-aligned representation of the speech spectrum. These features serve as our baseline and capture fundamental acoustic properties essential for speaker discrimination.
Speaker Embeddings: Speaker representations derived from the ECAPA-TDNN [12] architecture provide specialized speaker-discriminative information. These embeddings are designed to capture speaker characteristics independently of phonetic content, complementing the temporal acoustic features.
Paralinguistic Features: GeMAPS and eGeMAPS [30] introduce a theoretically-motivated selection of acoustic parameters. These features encompass prosodic, spectral, and voice quality characteristics that may enhance speaker differentiation beyond traditional acoustic representations.

Feature Combination Strategy

Our framework implements feature combination through vector concatenation, formally expressed as:

F_{c o m b i n e d} = F_{M F B} \oplus F_{a u x i l i a r y}

(1)

where ⊕ denotes the concatenation operation that preserves temporal alignment between feature streams. This approach enables the model to leverage complementary information while maintaining the temporal structure necessary for diarization.

2.3. Adaptation Methodology

The adaptation phase bridges the gap between our simulated training conversations and real-world telephone speech. Although our training data incorporates noise and reverberation augmentations, real CallHome conversations present unique challenges including spontaneous speaking patterns and telephone channel effects. Our methodology employs targeted strategies to address these domain differences:

2.3.1. Domain-Focused Adaptation

We propose a targeted adaptation approach that addresses a key domain mismatch problem in end-to-end speaker diarization models. Specifically, our initial model was trained on simulated conversations with up to 4 speakers, while the target CallHome corpus contains conversations with up to 7 speakers but with limited adaptation data available. This mismatch presents a fundamental adaptation challenge: should we expose the model to all available target domain data regardless of speaker count distribution, or should we restrict adaptation to maintain consistency with training conditions? Our methodology explores two strategic approaches:

Traditional adaptation using the complete CallHome adaptation set (CH1 all), which includes conversations with 2–7 speakers. This approach potentially extends the model’s capabilities to handle higher speaker counts but risks degrading performance on the more common 2–4 speaker scenarios due to the limited examples of higher speaker counts in the adaptation set.
Restricted adaptation using only conversations with 2–4 speakers (CH1 2–4spk), which maintains alignment with the original training distribution. This approach prioritizes robust performance within the speaker range the model was originally trained on, potentially sacrificing adaptability to higher speaker counts.

2.3.2. Regularization Methods

Our framework incorporates four key components designed to improve adaptation robustness:

Dropout Strategy: We introduce a systematic approach to dropout regularization during adaptation, exploring how increased network stochasticity affects the model’s ability to generalize from limited adaptation data.
Class-Balanced Learning: To address the uneven distribution of conversations across different speaker counts in the CallHome corpus, we implement a weighted sampling strategy during adaptation. This approach compensates for the natural imbalance where conversations with 2–4 speakers are significantly more frequent than those with 5–7 speakers. For example, only 10% of the conversations in the corpus contain 6–7 speakers, while over 60% feature 2–3 speakers. Our weighted sampling technique gives proportionally higher importance to underrepresented speaker counts during the adaptation process, ensuring the model receives sufficient exposure to all conversation types despite their imbalanced representation in the dataset.
Subsampling Rate Adjustment: We address conventional temporal resolution constraints through reduced subsampling during adaptation. Typically, EEND models apply a subsampling factor of 10 to the input feature sequence due to computational constraints. Our approach aims to preserve finer temporal details that may be crucial for accurate speaker transition detection.
Label Refinement: One of the key issues when using manually labeled data for speaker diarization is to address potential imprecisions in manual annotations. In order to address this issue, we implement a Gaussian kernel-based smoothing methodology. For each speaker s, we define the original binary label sequence $y_{s} = [y_{s, 1}; \dots; y_{s, T}]$ and apply a localized smoothing operation:

${\tilde{y}}_{s} = y_{s} * G$

(2)

where ∗ denotes convolution and $G$ represents a three-frame Gaussian kernel:

$G (t) = e^{- \frac{t^{2}}{2 σ^{2}}}, t \in {- 1, 0, 1}$

(3)

This operation creates smooth transitions at speaker change points while preserving label integrity in stable regions.

2.3.3. Combination of Strategies

Our methodology culminates in a unified approach that combines these adaptation components. This integration enables the examination of potential synergies between different adaptation strategies and their interaction with various feature representations.

3. Experimental Setup

3.1. Implementation Details

3.1.1. Feature Extraction Configuration

The baseline Mel-filterbank features consist of 23-dimensional coefficients extracted using a 25 ms window with a 10 ms frame shift. Following standard practice in EEND architectures, these features are contextualized through a stacking operation that combines 15 consecutive frames (7 frames on each side), resulting in a 345-dimensional vector sampled every 100 ms after applying a subsampling factor of 10.

For speaker embeddings, we extract 512-dimensional x-vectors using the ECAPA-TDNN architecture with a 1-s window and 100 ms shift. This configuration, established through prior experimentation [33], optimizes the trade-off between temporal resolution and speaker discriminative power. During preprocessing, silence segments identified through Oracle VAD (which uses reference labels to precisely identify non-speech regions and replace corresponding embeddings with zeros during both training and inference) are replaced with zero vectors to address the ECAPA-TDNN’s limitations in representing non-speech regions, as explored also in [33]. As demonstrated in our previous work, while Oracle VAD is used in these experiments for controlled comparison, a practical implementation could employ an external VAD system (such as Kaldi’s VAD [35]) during inference to perform this replacement, with final performance depending on the accuracy of that detection system. We do not perform stacking of these embeddings, as each individual x-vector already incorporates sufficient temporal context, and their high dimensionality makes additional stacking unnecessary. When concatenated with 345-dimensional MFB features, this produces a final feature vector of 857 dimensions (512 + 345).

The paralinguistic parameter sets, GeMAPS and eGeMAPS, are extracted using a 60 ms analysis window with 10 ms shift. We did not select a smaller window size due to inherent limitations in the minimum window size required for these feature sets, with 60 ms being the minimum possible. The base GeMAPS set produces 62 features encompassing frequency, energy, spectral, and voice quality parameters, while eGeMAPS extends this to 88 features with additional voice quality and spectral characteristics. When combining features through concatenation, temporal alignment is maintained through consistent 100 ms frame intervals. The resulting feature vector dimension depends on the context size, following the formula:

V e c t o r_{s i z e} = G e M A P S_{d i m e n s i o n} \times (2 \times c o n t e x t + 1)

. For example, with our proposed reduced set of 52 GeMAPS features presented in Section 4.1.1 and a context size of 2, the resulting vector has 260 dimensions (52 × (2 × 2 + 1)). When concatenated with 345-dimensional MFB features, this produces a final feature vector of 605 dimensions. In the results section, we present an analysis for two-speaker scenarios examining how different stacking contexts affect these feature types. Additionally, we explore the use of Oracle VAD during both training and inference for two-speaker cases. However, as our results demonstrate, GeMAPS already contain sufficient information about non-speech segments, making Oracle VAD unnecessary for the multi-speaker experiments.

3.1.2. Model Configuration

SA-EEND encoder is composed of four transformer blocks, each containing four attention heads, producing 256-dimensional embeddings. The EDA module maintains a pool of 15 potential attractors during training, though typically only a subset achieve the existence probability threshold required for activation. The diarization layer employs a standard sigmoid activation function with a detection threshold of 0.5.

3.2. Training and Adaptation Protocol

3.2.1. Dataset Specifications

The training stage utilizes Simulated Conversations (SC) constructed following the generation algorithm described in [36], with a controlled speech overlap ratio of 34.4% as used also in [36]. The initial SC 2spk dataset, comprising 2480 h of two-speaker conversations, was generated using recordings from Switchboard-2 (Phases I, II, and III), Switchboard Cellular (Part1 and Part2) [37], and NIST Speaker Recognition Evaluation (2004, 2005, 2006, and 2008) corpora [38,39,40,41], all standardized to 8 kHz sampling rate. Following [18], the training data was augmented with background noise from the MUSAN dataset [42], and each utterance had a 50% probability of being convolved with a randomly chosen simulated room impulse response from the RIR dataset [43].

For multi-speaker scenarios, we expanded this approach to create the SC 1–4spk dataset, which provides approximately 2480 h per speaker configuration (one to four speakers), totaling 10,000 h of audio while maintaining similar augmentation strategies.

The evaluation employs the CallHome (https://catalog.ldc.upenn.edu/LDC2001S97) dataset, with one subset (CH1) used for adaptation and the other (CH2) for testing, both defined in the Kaldi [35] recipe for CallHome (https://github.com/kaldi-asr/kaldi/blob/master/egs/callhome_diarization/v2/run.sh). As shown in Table 1, the CallHome database presents a notable class imbalance, with two-speaker conversations comprising approximately 62% of the training set (CH1) and 59% of the testing set (CH2). Figure 2 illustrates the distribution of audio duration in hours across different number of speaker for both datasets, providing further insight into this imbalance.

3.2.2. Training Configuration

Initial training on SC 2spk employs the NoAM optimizer with 200,000 warm-up steps. The multi-speaker adaptation phase uses the Adam optimizer with a learning rate of 10⁻⁵ over 100 epochs. For domain adaptation experiments, we maintain the same optimizer configuration while exploring the following parameters and configurations:

Dropout rates: 0.1 (baseline), 0.3, 0.5.
Weighted sampling inversely proportional to speaker count frequency.
Subsampling factors: 10 (baseline), 5.
Gaussian smoothing kernel with $σ$ = 2 and 3-frame window.

3.3. Evaluation Framework

While speaker diarization systems are traditionally evaluated solely using Diarization Error Rate (DER), this single metric may not fully capture the nuanced performance aspects of modern diarization systems, particularly in challenging scenarios involving overlapped speech and multiple speakers. Therefore, we use a more complete set of metrics that examines system performance across multiple complementary dimensions. The software used for estimating the different metrics can be found in the BUT Speech@FIT repository (https://github.com/BUTSpeechFIT/diarization-utils).

3.3.1. Diarization Error Rate

The DER remains the primary metric for overall system performance evaluation and is computed as:

DER = \frac{Miss + FA + Speaker Error}{Total Speech Duration}

(4)

where Miss represents missed speech detection, FA indicates false alarm speech detection, and Speaker Error quantifies incorrect speaker attribution, all measured in seconds of audio. While this metric provides a general assessment of system performance, it may obscure specific areas of strength or weakness in the diarization process. Following standard practice, a 250 ms collar is typically applied around speaker boundaries to accommodate for annotation inconsistencies. In our final results, we present results both with and without this collar to provide a more comprehensive performance assessment.

3.3.2. Overlapped Speech Detection Performance

Given the significant challenge that overlapped speech poses in real-world conversations, we specifically evaluate the system’s capability to detect and handle such segments. This evaluation employs standard classification metrics: precision (P), recall (R), F1-score, and accuracy. These metrics provide insights into both the system’s ability to identify overlapped speech segments and its potential bias toward single-speaker detection.

3.3.3. Speaker Count Estimation

Accurate speaker count estimation is crucial for downstream applications (such as meeting transcription systems or automated call center analytics) and system reliability. We evaluate this aspect using two complementary metrics:

Speaker Counting Error: Quantifies the absolute difference between predicted and actual speaker counts
Speaker Count Accuracy: Measures the system’s precision in exactly matching the true number of speakers

This dual approach helps understand both the magnitude and frequency of speaker counting errors.

3.3.4. Speaker Turn Detection

The ability to accurately detect speaker transitions represents a fundamental aspect of diarization quality that might be masked in the overall DER. We evaluate turn detection performance using precision (P), recall (R), and F1-score, providing detailed insight into the system’s capability to identify speaker changes accurately. These metrics are particularly relevant when analyzing the impact of different feature sets and adaptation strategies on temporal resolution and speaker transition detection.

This comprehensive evaluation framework enables a more nuanced understanding of system performance across different aspects of the diarization task. By examining these complementary metrics, we can better assess the relative strengths and weaknesses of different feature configurations and adaptation strategies, providing insights that might be obscured by relying solely on DER.

4. Results and Analysis

4.1. Two-Speaker Scenario Performance

Figure 3 presents the diarization performance for different feature configurations in two-speaker conversations on the CH2 CallHome (CH2-2spk). The evaluation encompasses our proposed feature types (ECAPA-TDNN, GeMAPS, and eGeMAPS) compared against the baseline MFB features, both as standalone inputs and in concatenated configurations.

The baseline MFB features achieve a competitive DER of 10.09% without adaptation, which improves to 8.07% when domain adaptation is applied, without requiring VAD processing. When evaluating individual non-adapted features without MFB concatenation, GeMAPS features demonstrate the strongest standalone performance with a DER of 15.0%, compared to 23.5% for ECAPA-TDNN with Oracle VAD and 29.6% for eGeMAPS features. These results indicate that individual features, despite their specialized design, do not outperform the traditional MFB approach on their own.

Feature concatenation with MFB yields substantial improvements across all configurations. Without adaptation, ECAPA-TDNN ⊕ MFB yields a DER of 9.9%, while GeMAPS ⊕ MFB achieves 13.4% and eGeMAPS ⊕ MFB shows 24.7%. With adaptation, the ECAPA-TDNN ⊕ MFB combination achieves the best performance with a DER of 7.2% under Oracle VAD conditions, representing a 10.8% relative improvement over the baseline. Similarly, adapted GeMAPS ⊕ MFB achieves a DER of 9.9%, while adapted eGeMAPS ⊕ MFB reduces the error to 9.1%.

The impact of Oracle VAD varies significantly across feature configurations. For ECAPA-TDNN ⊕ MFB, the absence of Oracle VAD leads to performance degradation, with DER increasing from 7.2% to 12.6%, which aligns with the known limitations of ECAPA embeddings in representing non-speech segments. However, GeMAPS ⊕ MFB shows an interesting pattern: when combined with domain adaptation, it achieves better performance without Oracle VAD (9.0% DER) than with it (13.4% DER). This suggests that GeMAPS features might inherently capture speech activity information more effectively, making the system more robust when explicit VAD is not available.

Table 2 provides a detailed breakdown of the DER components for the best feature configurations. These optimal configurations will be utilized in subsequent multi-speaker experiments. The adapted ECAPA-TDNN ⊕ MFB system achieves the lowest overall DER (7.20%), with balanced errors across miss detection (3.0%), false alarm (3.05%), and speaker error (1.15%) components. Notably, the adapted GeMAPS ⊕ MFB system demonstrates the lowest false alarm rate (2.40%) and speaker error (0.90%), suggesting that these features excel at precisely identifying speech boundaries and distinguishing between speakers, particularly where background noise is present, despite having a higher miss rate (5.74%) than other configurations.

4.1.1. GeMAPS Analysis

Our analysis of GeMAPS features consists of two complementary studies: a feature importance assessment through ablation and an investigation of temporal context impact. For these experiments, we used a context size of 2 frames for the initial feature importance evaluation, resulting in a 310-dimensional input vector (5 × 62 features), which is comparable to the baseline MFB configuration (345 features) in terms of dimensionality.

Feature Group Importance (Inference Stage)

Figure 4 shows the DER degradation (

Δ

DER%) when different feature groups are perturbed during inference. In this context, “perturbed during inference” refers to averaging each parameter in the corresponding feature group across all frames in the utterance, effectively neutralizing their discriminative information while maintaining the overall structure. The formant-related features (14 parameters) show the highest impact, with a 27.28% DER increase when masked, followed by fundamental frequency features (F0, 10 parameters) with an 18.85% degradation. Frequency ratios and loudness parameters also demonstrate substantial importance, causing 15.42% and 15.24% DER increases respectively. In contrast, jitter/shimmer and voiced segment features show minimal impact (0.2% DER change), suggesting their limited contribution to speaker discrimination in our diarization framework. For more details about the fundamentals of the GeMAPS features, see [30].

Feature Set Optimization (Training Stage)

Based on the previous results, three EEND-EDA models were trained using SC 2spk with different GeMAPS configurations as proposed in Table 3. While intuition might suggest that fundamental frequency and formant features (24 parameters) would be sufficient for speaker discrimination, this reduced configuration significantly degrades performance, increasing DER to 45.33%. This substantial degradation can be attributed to the limited spectral information available in the 8 kHz telephone data used in the two-speaker scenario. At this sampling rate, the frequency range is constrained to 4 kHz, which severely restricts the discriminative power of formants, particularly for higher formants that often contain speaker-specific characteristics. Additionally, in telephone conversations, the standard telephone bandwidth filtering (approximately 300–4000 Hz) removes critical acoustic information, including higher formants and spectral details that would normally help distinguish between speakers. This bandwidth limitation, combined with channel effects and compression artifacts, significantly compromises fundamental frequency estimation accuracy, further reducing the effectiveness of these features when used in isolation.

However, our analysis enabled the identification of a reduced feature set that maintains performance while eliminating less contributive features. This optimized set consists of 52 parameters, including formants (14 features), fundamental frequency (10 features), frequency ratios (12 features), loudness parameters (10 features), and harmonics-to-noise ratio (6 features), while excluding voiced segments and jitter/shimmer features. This configuration achieves a DER of 13.93% compared to 13.60% with the complete set, demonstrating that we can maintain diarization performance while reducing computational complexity through informed feature selection.

Impact of Temporal Context

Figure 5 illustrates the results of multiple models trained for these three feature configurations using different context sizes. The complete GeMAPS and our proposed reduced set maintain stable performance across different context windows, with DER variations of less than 0.5% absolute. In contrast, the F0 + Formants configuration shows high instability and consistently poor performance regardless of the context size. This stability analysis supports our feature selection approach, demonstrating that the reduced feature set maintains robustness across different temporal scales while achieving computational efficiency through dimensionality reduction and reduced temporal context requierements.

4.2. Multi-Speaker Scenario

This section presents our key findings and analysis for multi-speaker diarization scenarios. For detailed experimental results, including full performance metrics across all feature configurations and evaluation criteria, we refer readers to Table A1, Table A2, Table A3, Table A4, Table A5, Table A6, Table A7, Table A8 and Table A9 in Appendix A, which provide a complete breakdown by feature type (MFB, ECAPA-TDNN ⊕ MFB, and GeMAPS ⊕ MFB, respectively).

4.2.1. Baseline Comparison

To establish a solid foundation for our feature analysis in multi-speaker scenarios, we first replicated the baseline EEND-EDA system following the three-stage training protocol described in Section 3.2. Table 4 presents a detailed comparison between our implementation and the original baseline results reported in [36].

While our implementation shows slightly higher DER values in most scenarios, the performance trends across different numbers of speakers and training configurations remain consistent with the original baseline. The differences in absolute performance can be attributed to variations in the simulated training data generation, despite following the same methodology. Both implementations demonstrate similar patterns: performance degradation as the number of speakers increases, and substantial improvements when incorporating multi-speaker training data (SC 1–4 spkr) and real conversational data adaptation (CH1 all).

To provide a more comprehensive assessment of our baseline system’s capabilities, we evaluated its performance across multiple metrics beyond DER as is shown in Table 5. In terms of speaker counting, the system achieves an accuracy of 73.20% in correctly identifying the number of speakers, with a mean Speaker Count Error (SCE) of 0.348 speakers. These results indicate a robust capability in estimating the number of participants in a conversation, particularly considering the challenging nature of multi-speaker scenarios.

The speaker turn detection performance reveals some challenges in precisely identifying speaker transitions, with a precision of 62.41% and recall of 46.31%, resulting in an F1-score of 53.16%. For overlapped speech detection (OSD), the system demonstrates a precision of 64.90% and recall of 42.40%, yielding an F1-score of 51.30%. The high OSD accuracy of 87.95% should be interpreted in the context of the natural class imbalance in conversational speech, where non-overlapped segments are more prevalent.

The results suggest that while the system performs reasonably well in speaker counting, there is substantial room for improvement in detecting speaker transitions and overlapped speech segments.

4.2.2. Feature Performance in Multi-Speaker Scenarios

We evaluated the performance of our proposed feature combinations: ECAPA-TDNN ⊕ MFB and GeMAPS ⊕ MFB across multi-speaker scenarios. Table 6 presents the DER results for both configurations compared to our baseline implementation.

To provide a more comprehensive analysis of system behavior, we evaluated additional performance metrics for each configuration, as shown in Table 7.

The comprehensive metrics analysis reveals interesting patterns across configurations. The baseline MFB features demonstrate superior performance in several aspects, particularly in speaker counting with the highest accuracy (73.20%) and lowest SCE (0.348). This advantage extends to speaker turn detection, where MFB achieves the best F1-score of 53.16%, outperforming both feature combinations. In terms of overlap speech detection, MFB maintains its edge with the highest F1-score (51.30%) and accuracy (87.95%).

When comparing the feature combinations, GeMAPS ⊕ MFB shows slightly better speaker counting performance than ECAPA-TDNN ⊕ MFB, with a lower SCE (0.428 vs. 0.444) and higher accuracy (66.40% vs. 64.00%). It also demonstrates more balanced performance in speaker turn detection, achieving a higher F1-score of 48.21% compared to 45.32% for ECAPA-TDNN ⊕ MFB. In terms of overlap speech detection, while both combinations achieve the same F1-score of 44.00%, they do so through different precision-recall trade-offs, with GeMAPS ⊕ MFB showing notably higher precision but lower recall compared to ECAPA-TDNN ⊕ MFB.

4.2.3. Domain-Focused Adaptation Analysis

Domain adaptation strategies plays a crucial role in optimizing diarization performance. Figure 6 illustrates the comparative impact of adaptation approaches across different feature configurations, specifically contrasting adaptation using the complete CH1 dataset (containing conversations with 2–7 speakers) versus a restricted subset containing only conversations with 2–4 speakers, which more closely resembles the original training conditions.

For the MFB baseline system Figure 6a, restricting the adaptation set to 2–4 speakers shows consistent improvements across all metrics. The DER decreases from 16.80% to 15.64%, while speaker counting becomes more reliable with SCE improving from 0.348 to 0.320 and accuracy increasing from 73.60% to 74.40%. Both speaker turn detection and overlap speech detection show slight improvements (F1-scores increasing from 53.2% to 54.0% and 51.3% to 52.6% respectively). The ECAPA-TDNN ⊕ MFB configuration Figure 6b demonstrates the most substantial benefits from restricted adaptation. DER significantly improves from 20.95% to 17.09%, a relative improvement of 18.4%. Speaker counting metrics show notable enhancement, with SCE decreasing from 0.444 to 0.376 and accuracy increasing from 64.0% to 68.4%. Turn detection F1-score improves from 45.32% to 48.33%, and OSD performance increases from 44.0% to 46.1% F1-score. For GeMAPS ⊕ MFB Figure 6c, the improvements are more modest but still consistent. DER shows a slight improvement from 23.21% to 22.50%, while speaker counting metrics improve marginally (SCE from 0.428 to 0.415, accuracy from 66.4% to 67.3%). However, turn detection performance shows a small degradation (F1-score from 48.21% to 47.09%), though this is compensated by improved OSD performance (F1-score from 44.0% to 45.27%).

These results demonstrate that feature selection significantly influences the effectiveness of domain adaptation in EEND systems. The ECAPA-TDNN embeddings, when concatenated with MFB features, show heightened sensitivity to the distribution of speakers in the adaptation data. This suggests that pre-trained speaker embeddings retain their speaker discriminative properties when used as input features for the EEND framework, making them more responsive to targeted adaptation strategies.

The differential response across feature types indicates a fundamental property of acoustic representations in diarization: features that excel at speaker discrimination (such as ECAPA-TDNN embeddings) benefit more substantially from adaptation data that closely matches the target domain’s speaker distribution. This is likely because these features emphasize speaker-specific characteristics that vary more significantly with the number of participants in a conversation.

From a practical implementation perspective, these findings suggest that adaptation strategies should be customized based on the selected feature representation. While MFB features show consistent but modest improvements with restricted adaptation, systems utilizing speaker embedding features should prioritize adaptation data that closely aligns with the expected deployment conditions, particularly in terms of speaker count. This insight enables more efficient utilization of adaptation resources by focusing on the most relevant subset of available data rather than maximizing adaptation data volume indiscriminately.

4.2.4. Enhanced Adaptation Techniques

Having demonstrated the benefits of restricted adaptation (CH1 2–4), we explore additional techniques to further improve the adaptation process. These enhancements aim to address specific challenges in the adaptation phase: overfitting (through increased dropout), class imbalance (through weighted sampling), and annotation precision (through label smoothing).

Dropout Regularization

The impact of increased dropout rates during adaptation shows distinct patterns across different feature configurations, as illustrated in Figure 7. For the baseline MFB features, as shown in Table 8, moderate dropout (0.3) maintains performance stability while providing minor improvements in speaker turn detection, with the F1-score increasing from 53.16% to 54.13% when using the complete adaptation set. However, higher dropout rates (0.5) lead to significant degradation across all metrics, particularly affecting overlap speech detection where the F1-score decreases dramatically from 51.3% to 30.4%.

The ECAPA-TDNN ⊕ MFB configuration, detailed in Table 9, demonstrates the most notable benefits from increased regularization, especially when combined with restricted adaptation. With a dropout rate of 0.3 and CH1 2–4 spk adaptation, the system achieves its best performance with a DER of 15.8%, representing a 24.6% relative improvement over the baseline configuration (20.95%). This enhancement extends beyond DER, with improved speaker counting accuracy (68.8%) and more robust turn detection (F1-score of 48.89%). The results suggest that this feature combination particularly benefits from the regularizing effect of dropout in preventing overfitting to the small adaptation set.

For GeMAPS ⊕ MFB features, Table 10 reveals that the impact of dropout varies significantly between adaptation strategies. When using the complete adaptation set, increased dropout leads to substantial performance degradation, with DER rising from 23.21% to 31.67% at 0.3 dropout and further deteriorating to 32.35% at 0.5 dropout. However, with restricted adaptation, moderate dropout (0.3) yields notable improvements, reducing DER to 19.74% and enhancing speaker counting accuracy from 67.3% to 73.2%. This dichotomy suggests that the effectiveness of dropout for GeMAPS features is highly dependent on the adaptation data distribution, with better results achieved when maintaining speaker count consistency between training and adaptation phases.

Across all feature configurations, the experiments reveal a consistent pattern: moderate dropout (0.3) combined with restricted adaptation provides the most stable and effective regularization strategy. This combination helps mitigate overfitting while maintaining the model’s ability to capture speaker-specific characteristics, particularly beneficial for speaker counting and turn detection tasks. However, the results also emphasize the importance of careful dropout rate selection, as excessive regularization (0.5) consistently leads to performance degradation, particularly affecting the system’s ability to detect overlapped speech segments.

Class-Balanced Learning

To address the inherent class imbalance of the number of speakers in the adaptation set, where conversations with higher speaker counts are underrepresented, we implemented a weighted batch sampling strategy. This approach adjusts the sampling probability inversely proportional to the class frequency, aiming to provide more balanced exposure to different conversation configurations during adaptation.

Figure 8 illustrates the impact of weighted sampling on speaker counting performance. The results reveal distinct patterns across feature configurations. For the baseline MFB features with complete adaptation (CH1 all), weighted sampling maintains comparable speaker counting performance, with accuracy decreasing marginally from 73.6% to 72.4%. However, when applied to restricted adaptation, it leads to a more notable degradation in accuracy (74.4% to 70.4%), suggesting that the weighted approach might disrupt the learning of speaker characteristics when the adaptation set is more homogeneous.

The ECAPA-TDNN ⊕ MFB configuration demonstrates contrasting behavior between adaptation strategies. With complete adaptation, weighted sampling significantly degrades speaker count accuracy from 64.0% to 56.3%. Conversely, under restricted adaptation, it shows remarkable stability, maintaining performance (68.4% to 69.6%) while improving SCE from 0.376 to 0.360. This suggests that the effectiveness of weighted sampling for this feature combination is highly dependent on the adaptation data distribution.

As shown in Table 11, Table 12 and Table 13, the impact of weighted sampling extends beyond speaker counting metrics. For MFB features with restricted adaptation, while speaker counting performance decreases, there is a notable improvement in speaker turn detection precision (62.63% to 62.24%) and recall (47.51% to 46.98%). This suggests that despite the apparent degradation in speaker counting, the model might be learning more robust speaker-specific characteristics.

The GeMAPS ⊕ MFB configuration exhibits the most pronounced benefits from weighted sampling under restricted adaptation. Speaker count accuracy improves from 67.3% to 72.4%, while speaker turn detection shows substantial enhancement, with the F1-score increasing from 47.09% to 51.63%. However, this comes at the cost of increased missed speech (Miss rate increasing from 9.6% to 12.3%), indicating a potential trade-off between speaker discrimination and speech detection capabilities.

Across all feature configurations, weighted sampling demonstrates a clear interaction with the adaptation strategy. When applied to the complete adaptation set (CH1 all), it generally leads to performance degradation, particularly affecting DER through increased missed speech. However, when combined with restricted adaptation (CH1 2–4 spk), it shows potential benefits, especially for feature configurations that combine acoustic and speaker-specific information (ECAPA-TDNN and GeMAPS). This suggests that weighted sampling might be most effective when the adaptation data distribution is more controlled, allowing the model to learn from a balanced set of examples without being overwhelmed by the complexity of higher speaker counts.

Label Smoothing with Enhanced Temporal Resolution

The effectiveness of Gaussian kernel-based label smoothing demonstrates a strong dependence on the temporal resolution used during adaptation, as illustrated in Figure 9. The interaction between these factors varies significantly across feature configurations, revealing important patterns in system behavior.

The baseline MFB features (Table 14) show that while label smoothing alone offers limited benefits, its combination with reduced subsampling yields notable improvements. The DER decreases from 16.8% to 16.1%, with concurrent enhancements in speaker turn detection performance (F1-score increasing from 53.16% to 54.3%). The improvement manifests primarily through reduced missed speech (from 9.0% to 8.7%) and enhanced speaker boundary precision. The 250 ms collar is a key aspect in this context, as the label smoothing performed with either a subsampling factor of 10 or 5 remains within the margin covered by the collar during inference.

For ECAPA-TDNN ⊕ MFB features (Table 15), temporal resolution proves crucial for effective label smoothing. Standard subsampling with label smoothing leads to performance degradation, increasing DER from 20.95% to 24.87%. However, when combined with reduced subsampling, the system achieves improved speaker discrimination, reducing DER to 20.1% and enhancing turn detection precision from 56.20% to 57.5%. This pattern underscores the importance of maintaining sufficient temporal granularity for capturing subtle speaker transitions.

The GeMAPS ⊕ MFB configuration (Table 16) exhibits the most pronounced synergistic effects. The combination of label smoothing and reduced subsampling significantly improves missed speech detection (from 10.0% to 6.8%) while maintaining stable performance in speaker turn detection and overlap speech metrics. The reduction in DER from 23.21% to 21.1% demonstrates the effectiveness of this combined approach.

These findings establish a clear relationship between label smoothing effectiveness and temporal resolution in the adaptation process. The consistent improvements observed with reduced subsampling suggest that finer temporal granularity enables better utilization of the smoother decision boundaries created by label smoothing, particularly in challenging scenarios involving rapid speaker transitions and overlapped speech.

Adjusting Subsampling Rate

To systematically investigate the impact of temporal resolution on diarization performance, we conducted experiments reducing the subsampling factor from the conventional 10 to 5 across all feature configurations. This modification effectively doubles the temporal resolution during adaptation, potentially allowing the model to capture more precise speaker transitions and overlapped speech patterns.

Figure 10 illustrates the performance impact of enhanced temporal resolution across feature configurations, with all systems using consistent regularization (dropout = 0.3) and adaptation strategy (CH1 2–4 spk). For the baseline MFB features, temporal resolution enhancement yields modest improvements, with DER decreasing from 16.17% to 15.76%, a relative improvement of 2.5%. The ECAPA-TDNN ⊕ MFB configuration demonstrates more substantial gains, with DER decreasing from 15.8% to 14.89%, a relative improvement of 5.8%. Similarly, the GeMAPS ⊕ MFB configuration shows a 4.4% relative improvement, with DER decreasing from 19.74% to 18.87%. The consistent DER improvement across all configurations suggests that enhanced temporal resolution provides inherent benefits to the diarization process. Notably, the ECAPA-TDNN ⊕ MFB configuration achieves the lowest overall DER of 14.89% in our experiments, establishing a new performance benchmark. This result underscores the importance of temporal resolution as a key factor in adaptation strategy optimization. The impact of enhanced temporal resolution on speaker turn detection performance varies across configurations. While the MFB features show minimal change in F1-score (55.26% to 55.14%), the ECAPA-TDNN ⊕ MFB configuration demonstrates a more substantial improvement from 48.89% to 50.53%. This 3.4% relative enhancement in boundary detection precision is particularly significant as it indicates improved capability to identify precise speaker transitions, a critical aspect of diarization quality that might be obscured in the overall DER metric.

Table 17 presents a detailed analysis of error components for systems with enhanced temporal resolution. The ECAPA-TDNN ⊕ MFB configuration demonstrates a well-balanced error distribution, with missed speech at 5.7%, false alarms at 3.7%, and speaker confusion error at 5.5%. In contrast, the GeMAPS ⊕ MFB configuration exhibits a distinct error pattern with substantially higher missed speech (12.3%) but lower false alarms (1.6%) and speaker confusion (5.0%). This suggests that the enhanced temporal resolution interacts differently with each feature representation’s inherent characteristics, affecting the system’s detection thresholds in distinctive ways.

When considering our comprehensive enhancement strategy—combining restricted adaptation, moderate dropout regularization, and enhanced temporal resolution—the ECAPA-TDNN ⊕ MFB configuration achieves a remarkable 29.0% relative improvement over the configuration without enhancements (20.95% to 14.89%). More importantly, it surpasses the real baseline MFB system trained with CH1-all by 11.4% relative (16.8% to 14.89%). This substantial performance gain demonstrates the effectiveness of our adaptation optimization approach and establishes a new benchmark for end-to-end neural diarization on conversational speech.

4.3. Optimal Feature Configurations Analysis

Based on our extensive experimentation, we can now identify the optimal configurations for each feature type and their respective strengths and weaknesses. Table 18 summarizes the best-performing configuration for each feature type across multiple evaluation metrics.

MFB Features

The baseline MFB features achieves a DER of 15.76% when optimized with our adaptation strategy. Their primary strength lies in speaker counting accuracy (74.8%) and speaker turn detection (F1-score of 55.14%), outperforming both alternative configurations in these metrics. MFB features also demonstrate superior performance in overlap speech detection with an F1-score of 52.8%. The error component analysis reveals a balanced profile with a moderate miss rate (8.9%) and the lowest false alarm rate (2.1%) among the configurations. As shown in Table 19, MFB features show a modest but consistent cumulative improvement of 6.2% relative DER reduction from baseline to fully optimized configuration. The most significant gain comes from restricting the adaptation set to CH1 2–4 speakers (6.9% improvement), while dropout regularization actually causes a slight performance degradation for this feature type. This suggests that MFB features, being less specialized than the other configurations, benefit primarily from adaptation data that better matches the evaluation scenarios rather than from additional regularization techniques.

ECAPA-TDNN ⊕ MFB Features

This configuration achieves the best overall DER of 14.89%, representing an 11.4% relative improvement over the baseline MFB system adapted on the complete CH1 dataset. Its primary advantage comes from a significantly lower miss rate (5.7%) compared to other configurations, indicating superior capability in detecting speech segments. However, this comes at the cost of a higher false alarm rate (3.7%) and slightly lower speaker counting accuracy (68.8%). The configuration demonstrates reasonable performance in speaker turn detection (F1-score of 50.53%) and overlap speech detection (F1-score of 47.0%). The error profile suggests that ECAPA-TDNN embeddings contribute particularly to improved speech activity detection and speaker discrimination in overlapped regions. The cumulative impact of our enhancement techniques is most pronounced for the ECAPA-TDNN ⊕ MFB configuration, with a remarkable 29.0% relative improvement from baseline to fully optimized settings. This configuration benefits substantially from every enhancement step: restricted adaptation provides an 18.4% improvement, increased dropout adds another 7.5%, and enhanced temporal resolution contributes a further 5.8%. This progressive improvement highlights the synergistic effect of our adaptation optimizations with speaker-discriminative embeddings, allowing the system to fully leverage ECAPA-TDNN’s specialized capabilities while mitigating its potential overfitting tendencies.

GeMAPS ⊕ MFB Features

The GeMAPS ⊕ MFB configuration achieves a DER of 18.87% when optimized. While this is higher than the other configurations, it demonstrates unique strengths in specific components. It achieves the lowest false alarm rate (1.6%) and speaker error rate (5.0%), indicating high precision in speech segment boundaries and speaker attribution. The configuration also performs well in speaker counting accuracy (73.2%) and speaker turn detection (F1-score of 52.65%), approaching the performance of MFB features. However, its substantially higher miss rate (12.3%) and lower overlap speech detection performance (F1-score of 41.5%) prevent it from achieving a competitive overall DER. This error profile suggests that GeMAPS features are particularly valuable for precise speaker attribution but may be less sensitive to detecting all speech regions. For GeMAPS ⊕ MFB features, the cumulative improvement of 18.7% follows an interesting pattern. The initial restricted adaptation provides only a modest 3.1% improvement, but the addition of dropout regularization yields a substantial 12.3% gain—the largest single-step improvement observed across all configurations. This suggests that GeMAPS features, with their specialized prosodic information, are particularly prone to overfitting and benefit significantly from appropriate regularization techniques. The enhanced temporal resolution contributes an additional 4.4% improvement, confirming the value of finer-grained analysis for capturing the subtle prosodic variations that GeMAPS features represent.

4.4. Comparison with State-of-the-Art Systems

To provide a comprehensive evaluation of our proposed approaches, we compared our best-performing systems against several state-of-the-art diarization frameworks. This comparison includes both traditional clustering-based methods (VBx, AHC) and newer neural approaches (Pyannote). Additionally, we investigated the impact of temporal resolution by evaluating our systems with different subsampling rates, which offers important insights into the trade-off between computational efficiency and diarization accuracy.

Table 20 presents a comparison of our systems against state-of-the-art approaches. Several key observations emerge from these results:

First, the impact of temporal resolution (controlled by subsampling rate and median filter length) varies significantly across feature configurations. During inference, a median filter is applied to smooth the output predictions. The standard implementation uses an 11-frame median filter when employing a subsampling factor of 10. The length of this filter must be adjusted proportionally to the subsampling rate; therefore, when using a subsampling factor of 5, the median filter length is reduced to 5 frames. For ECAPA-TDNN ⊕ MFB, finer temporal precision (subsampling = 5, median = 5) substantially improves collar-free performance, reducing DER from 25.78% to 22.69%, a remarkable 12.0% relative improvement. This suggests that speaker-discriminative embeddings particularly benefit from finer temporal granularity for precise boundary detection. Similarly, GeMAPS ⊕ MFB shows an 7.7% relative improvement in collar-free evaluation when using finer temporal precision. In contrast, for MFB features, while collar-free performance improves with enhanced temporal resolution (26.00% to 24.21%), collared performance actually degrades (15.64% to 17.03%). This indicates that the benefit of enhanced resolution for MFB features manifests primarily at speaker boundaries, but may introduce additional errors in broader segment attribution. When comparing against established diarization approaches, our ECAPA-TDNN ⊕ MFB configuration with enhanced temporal resolution (22.69% DER without collar) outperforms both AHC (25.61%) and Pyannote (29.30%), while approaching the performance of VBx (21.77%). Notably, the VBx system incorporates specialized x-vector embeddings and Bayesian HMM refinement, making it a highly optimized clustering-based approach. The competitive performance of our end-to-end neural system demonstrates the effectiveness of our adaptation strategy and feature integration approach. For collared evaluation, our ECAPA-TDNN ⊕ MFB with standard temporal resolution achieves 14.89% DER, which is competitive with VBx (14.21%) and substantially better than AHC (17.64%). This suggests that our approach effectively captures the broader speaker attribution patterns while maintaining reasonable boundary precision. The consistent performance advantage of our ECAPA-TDNN ⊕ MFB configuration, particularly with enhanced temporal resolution, highlights the value of combining acoustic features with specialized speaker embeddings. Furthermore, the dramatic impact of temporal resolution adjustments—which come with minimal computational overhead—demonstrates that significant performance gains can be achieved through careful parameter tuning without increasing model complexity. These findings establish our adapted ECAPA-TDNN ⊕ MFB with enhanced temporal resolution as a highly competitive approach for speaker diarization, capable of rivaling and, in certain metrics, surpassing specialized clustering-based methods while maintaining the adaptability and end-to-end nature of neural diarization frameworks. While VBx currently maintains a slight performance edge, our end-to-end approach provides several practical advantages: it offers a unified architecture that can be jointly optimized, enables more straightforward adaptation to new domains, and inherently handles overlapped speech—a known challenge for clustering-based methods. These operational benefits, combined with competitive performance metrics, demonstrate that our approach represents a promising direction for practical diarization applications.

5. Conclusions

This research presented a thorough investigation into feature integration strategies for end-to-end neural speaker diarization, focusing on conversational telephone speech. Our most significant achievement is demonstrating that carefully optimized end-to-end neural diarization systems can achieve performance comparable to state-of-the-art clustering-based approaches like VBx, while offering crucial advantages in adaptability, unified architecture, and inherent handling of overlapped speech. This represents a major advancement in neural diarization technology, as end-to-end systems have traditionally lagged behind multi-stage clustering approaches in benchmark performance.

The ECAPA-TDNN speaker embeddings, when combined with Mel-filterbank features, achieved the best overall performance with a 29% relative reduction in Diarization Error Rate (DER) through systematic adaptation strategy optimization. It is important to note that for this configuration, Oracle VAD was employed during both training and inference phases, addressing the known limitations of ECAPA embeddings in representing non-speech regions. For fair comparison, similar Oracle VAD conditions were applied to all other feature configurations in the two-speaker scenario, ensuring consistent evaluation conditions across all experiments. Despite these controlled conditions, the ECAPA-TDNN with MFB configuration demonstrated superior capability in detecting speech segments with the lowest miss rate among all tested configurations.

GeMAPS paralinguistic features showed distinct strengths in speaker attribution precision and boundary detection, achieving the lowest false alarm and speaker confusion rates. However, their higher miss rate limited overall DER performance. The study also identified a reduced optimal set of 52 GeMAPS parameters that maintained performance while reducing computational complexity.

The research revealed that adaptation strategies significantly impact performance, with restricted adaptation to similar speaker counts providing consistent benefits across all feature configurations. This finding challenges conventional wisdom [36] about adaptation to the complete range of potential speakers.

Enhanced temporal resolution through reduced subsampling proved highly beneficial, particularly for speaker-discriminative features. For ECAPA-TDNN with MFB features, this approach yielded a 12% relative improvement in collar-free evaluation, highlighting the importance of precise temporal analysis in diarization.

The optimized ECAPA-TDNN with MFB features configuration achieved competitive performance against state-of-the-art systems, including specialized clustering-based methods, while maintaining the adaptability advantages of end-to-end neural approaches.

These findings advance understanding of feature integration in neural diarization and provide practical guidelines for selecting and optimizing acoustic representations based on specific application requirements in telephone speech scenarios.

Author Contributions

Conceptualization, J.I.A.-T.; methodology, J.I.A.-T.; software, J.I.A.-T. and A.L.-D.; validation, J.I.A.-T.; formal analysis, J.I.A.-T.; investigation, J.I.A.-T.; data curation, J.I.A.-T.; writing—original draft, J.I.A.-T.; writing—review, J.I.A.-T.; editing, J.I.A.-T.; visualization, J.I.A.-T.; supervision, A.L.-D. and D.R.; project administration, A.L.-D. and D.R.; funding acquisition, A.L.-D. and D.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by PID2021-125943OB-I00, MCIN/AEI/10.13039/501100011033/FEDER, UE from the Spanish Ministerio de Ciencia e Innovacion y Agencia del Fondo Europeo de Desarrollo Regional and project SI4/PJI/2024-00237 (COSER-IA), Comunidad de Madrid.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from https://catalog.ldc.upenn.edu/LDC2001S97.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

This appendix presents the complete results from our experimental analysis across all three feature configurations (MFB, ECAPA-TDNN ⊕ MFB, and GeMAPS ⊕ MFB) in conjunction with the various regularization techniques discussed throughout the main document. These detailed findings include performance metrics for different dropout rates, adaptation strategies, temporal resolution adjustments, and their corresponding impact on multiple evaluation criteria. The comprehensive data provided here serves as supplementary evidence for the conclusions drawn in the primary analysis and offers additional insights into the effectiveness of each configuration under different experimental conditions.

Appendix A.1. Full Results for Mel-Filterbanks

Table A1. Performance comparison of EEND-EDA with Mel-Filterbanks using different adaptation sets (CH1 all and CH1 2–4 spk) and subsampling rates. DER components are reported alongside Speaker Counting, Speaker Turn Detection, and Overlap Speech Detection metrics.

Modification	Adapt Subsampling	DER Components (%)				Speaker Counting		Speaker Turn Detection			Overlap Speech Detection
Modification	Adapt Subsampling	Miss	FA	Spk. Error	Total	SCE	Accuracy	Precision	Recall	F1	Precision	Recall	F1
CH1 all	5	8.9	2.5	5.1	16.5	0.342	0.738	0.630	0.470	0.538	0.650	0.430	0.518
CH1 2–4 spk	5	8.3	2.6	5.1	16.0	0.315	0.748	0.635	0.480	0.547	0.648	0.450	0.530
CH1 all	10	9.0	2.6	5.2	16.8	0.348	0.736	0.624	0.463	0.532	0.649	0.424	0.513
CH1 2–4 spk	10	8.0	2.7	4.9	15.6	0.320	0.744	0.626	0.475	0.540	0.644	0.445	0.526

Table A2. Performance comparison of different augmentation techniques applied to Mel-Filterbanks features using CH1 all adaptation set. The table presents the effects of varying Dropout rates, Weighted Sampling, and Label Smoothing at different subsampling rates on DER components, Speaker Counting, Speaker Turn Detection, and Overlap Speech Detection metrics.

Modification	Adapt Subsampling	DER Components (%)				Speaker Counting		Speaker Turn Detection			Overlap Speech Detection
Modification	Adapt Subsampling	Miss	FA	Spk. Error	Total	SCE	Accuracy	Precision	Recall	F1	Precision	Recall	F1
Dropout = 0.3	10	9.1	2.5	5.3	16.9	0.332	0.732	0.619	0.481	0.541	0.656	0.420	0.512
Dropout = 0.5		13.2	1.4	7.0	21.6	0.436	0.660	0.602	0.451	0.516	0.694	0.194	0.304
Weighted Sampling		7.7	3.4	5.8	16.9	0.343	0.724	0.579	0.453	0.508	0.620	0.427	0.506
Label Smoothing		9.3	2.7	5.6	17.6	0.384	0.716	0.614	0.448	0.519	0.642	0.432	0.517
Dropout = 0.3	5	8.5	2.1	5.2	15.9	0.304	0.756	0.620	0.482	0.542	0.651	0.420	0.511
Dropout = 0.5		10.4	2.6	5.9	18.9	0.400	0.640	0.602	0.476	0.532	0.641	0.298	0.407
Weighted Sampling		11.1	2.2	5.8	19.1	0.356	0.704	0.622	0.470	0.535	0.651	0.314	0.423
Label Smoothing		8.7	2.4	5.0	16.1	0.335	0.745	0.635	0.475	0.543	0.655	0.435	0.523

Table A3. Performance comparison of different augmentation techniques applied to Mel-Filterbanks features using CH1 2–4 spk adaptation set. The table presents the effects of varying Dropout rates, Weighted Sampling, and Label Smoothing at different subsampling rates on DER components, Speaker Counting, Speaker Turn Detection, and Overlap Speech Detection metrics.

Modification	Adapt Subsampling	DER Components (%)				Speaker Counting		Speaker Turn Detection			Overlap Speech Detection
Modification	Adapt Subsampling	Miss	FA	Spk. Error	Total	SCE	Accuracy	Precision	Recall	F1	Precision	Recall	F1
Dropout = 0.3	10	9.0	2.2	5.0	16.2	0.312	0.748	0.627	0.494	0.553	0.665	0.420	0.515
Dropout = 0.5		11.4	2.4	5.7	19.6	0.364	0.680	0.594	0.477	0.529	0.618	0.259	0.365
Weighted Sampling		7.8	3.0	5.3	16.1	0.316	0.744	0.622	0.470	0.535	0.628	0.457	0.529
Label Smoothing		7.8	4.2	5.4	17.4	0.332	0.724	0.606	0.447	0.514	0.560	0.470	0.514
Dropout = 0.3	5	8.9	2.1	4.7	15.8	0.312	0.720	0.628	0.493	0.551	0.663	0.431	0.522
Dropout = 0.5		11.1	2.5	5.6	19.1	0.404	0.648	0.600	0.482	0.534	0.640	0.295	0.404
Weighted Sampling		8.0	2.7	5.5	16.1	0.312	0.760	0.630	0.472	0.540	0.645	0.447	0.528
Label Smoothing		8.0	2.8	5.7	16.4	0.340	0.736	0.625	0.470	0.536	0.631	0.449	0.525

Appendix A.2. Full Results for ECAPA-TDNN ⊕ Mel-Filterbanks

Table A4. Performance comparison of EEND-EDA with ECAPA-TDNN concatenated with Mel-Filterbanks using different adaptation sets (CH1 all and CH1 2–4 spk) and subsampling rates. Results show DER components alongside Speaker Counting, Speaker Turn Detection, and Overlap Speech Detection metrics.

Modification	Adapt Subsampling	DER Components (%)				Speaker Counting		Speaker Turn Detection			Overlap Speech Detection
Modification	Adapt Subsampling	Miss	FA	Spk. Error	Total	SCE	Accuracy	Precision	Recall	F1	Precision	Recall	F1
CH1 all	5	5.0	8.2	7.4	20.6	0.438	0.648	0.570	0.388	0.462	0.432	0.460	0.445
CH1 2–4 spk	5	5.2	6.2	5.4	16.8	0.370	0.690	0.605	0.410	0.490	0.455	0.475	0.465
CH1 all	10	5.0	8.4	7.6	21.0	0.444	0.640	0.562	0.380	0.453	0.425	0.456	0.440
CH1 2–4 spk	10	5.3	6.3	5.5	17.1	0.376	0.684	0.598	0.405	0.483	0.452	0.470	0.461

Table A5. Performance evaluation of ECAPA-TDNN concatenated with Mel-Filterbanks features using CH1 all speakers adaptation set. Results compare the impact of different regularization techniques (Dropout, Weighted Sampling, and Label Smoothing) across multiple subsampling rates on DER components, Speaker Counting, Speaker Turn Detection, and Overlap Speech Detection metrics.

Modification	Adapt Subsampling	DER Components (%)				Speaker Counting		Speaker Turn Detection			Overlap Speech Detection
Modification	Adapt Subsampling	Miss	FA	Spk. Error	Total	SCE	Accuracy	Precision	Recall	F1	Precision	Recall	F1
Dropout = 0.3	10	6.1	6.3	8.0	20.5	0.384	0.700	0.589	0.387	0.467	0.433	0.451	0.442
Dropout = 0.5		6.0	6.4	11.0	23.4	0.592	0.572	0.553	0.361	0.437	0.432	0.285	0.344
Weighted Sampling		8.2	7.7	8.0	23.9	0.476	0.563	0.555	0.369	0.443	0.405	0.446	0.425
Label Smoothing		10.2	5.2	9.5	24.9	0.548	0.588	0.542	0.348	0.424	0.442	0.371	0.403
Dropout = 0.3	5	9.5	3.5	8.9	21.9	0.552	0.576	0.603	0.380	0.466	0.496	0.371	0.425
Dropout = 0.5		6.5	4.8	10.9	22.2	0.572	0.568	0.580	0.366	0.449	0.446	0.390	0.416
Weighted Sampling		8.1	6.9	8.9	23.9	0.428	0.668	0.544	0.358	0.432	0.392	0.341	0.365
Label Smoothing		4.8	8.0	7.3	20.1	0.432	0.652	0.575	0.392	0.465	0.435	0.465	0.450

Table A6. Performance evaluation of ECAPA-TDNN concatenated with Mel-Filterbanks features using CH1 2–4 speakers adaptation set. The table presents a comparison of different regularization techniques (Dropout, Weighted Sampling, and Label Smoothing) at varying subsampling rates, measuring their impact on DER components, Speaker Counting, Speaker Turn Detection, and Overlap Speech Detection metrics.

Modification	Adapt Subsampling	DER Components (%)				Speaker Counting		Speaker Turn Detection			Overlap Speech Detection
Modification	Adapt Subsampling	Miss	FA	Spk. Error	Total	SCE	Accuracy	Precision	Recall	F1	Precision	Recall	F1
Dropout = 0.3	10	5.7	4.0	6.0	15.8	0.384	0.688	0.604	0.411	0.489	0.505	0.413	0.454
Dropout = 0.5		6.0	5.4	8.3	19.7	0.464	0.628	0.571	0.382	0.457	0.443	0.374	0.406
Weighted Sampling		5.3	6.2	5.3	16.9	0.360	0.696	0.599	0.406	0.484	0.455	0.475	0.465
Label Smoothing		6.9	7.4	5.7	20.0	0.380	0.688	0.587	0.382	0.463	0.423	0.434	0.428
Dropout = 0.3	5	5.7	3.7	5.5	14.9	0.392	0.676	0.613	0.430	0.505	0.525	0.413	0.462
Dropout = 0.5		5.9	4.7	6.8	17.4	0.392	0.672	0.587	0.398	0.475	0.487	0.389	0.432
Weighted Sampling		5.6	5.0	5.5	16.2	0.388	0.688	0.606	0.414	0.492	0.491	0.448	0.468
Label Smoothing		5.1	6.0	5.3	16.4	0.365	0.695	0.610	0.415	0.495	0.460	0.480	0.470

Appendix A.3. Full Results for GeMAPS ⊕ Mel-Filterbanks

Table A7. Performance comparison of EEND-EDA with GeMAPS concatenated with Mel-Filterbanks using different adaptation sets (CH1 all and CH1 2–4 spk) and subsampling rates. Results are reported for DER components, Speaker Counting, Speaker Turn Detection, and Overlap Speech Detection metrics.

Modification	Adapt Subsampling	DER Components (%)				Speaker Counting		Speaker Turn Detection			Overlap Speech Detection
Modification	Adapt Subsampling	Miss	FA	Spk. Error	Total	SCE	Accuracy	Precision	Recall	F1	Precision	Recall	F1
CH1 all	5	7.2	6.9	7.8	21.9	0.452	0.632	0.548	0.372	0.443	0.415	0.442	0.428
CH1 2–4 spk	5	11.8	3.0	6.0	20.8	0.358	0.715	0.605	0.458	0.522	0.605	0.310	0.410
CH1 all	10	10.0	5.2	8.0	23.2	0.428	0.664	0.546	0.432	0.482	0.558	0.363	0.440
CH1 2–4 spk	10	9.6	4.8	8.1	22.5	0.415	0.673	0.534	0.421	0.471	0.562	0.379	0.453

Table A8. Performance evaluation of GeMAPS concatenated with Mel-Filterbanks features using CH1 all speakers adaptation set. The table compares various regularization techniques (Dropout, Weighted Sampling, and Label Smoothing) across different subsampling rates, showing their impact on DER components, Speaker Counting, Speaker Turn Detection, and Overlap Speech Detection metrics.

Modification	Adapt Subsampling	DER Components (%)				Speaker Counting		Speaker Turn Detection			Overlap Speech Detection
Modification	Adapt Subsampling	Miss	FA	Spk. Error	Total	SCE	Accuracy	Precision	Recall	F1	Precision	Recall	F1
Dropout = 0.3	10	13.8	9.3	8.6	31.7	0.524	0.592	0.568	0.435	0.493	0.384	0.336	0.359
Dropout = 0.5		22.2	2.5	7.7	32.4	0.636	0.500	0.516	0.382	0.439	0.530	0.151	0.234
Weighted Sampling		17.6	3.0	7.8	28.4	0.568	0.576	0.548	0.406	0.466	0.553	0.261	0.355
Label Smoothing		10.6	4.9	8.3	23.8	0.396	0.688	0.538	0.428	0.477	0.541	0.382	0.448
Dropout = 0.3	5	20.0	1.4	5.7	27.2	0.492	0.624	0.573	0.429	0.490	0.686	0.229	0.344
Dropout = 0.5		15.6	24.8	7.3	47.8	0.828	0.348	0.487	0.422	0.452	0.270	0.368	0.311
Weighted Sampling		15.5	2.6	7.2	25.4	0.480	0.636	0.563	0.420	0.481	0.606	0.297	0.398
Label Smoothing		14.5	2.3	7.5	24.2	0.464	0.648	0.567	0.420	0.483	0.624	0.285	0.391

Table A9. Performance evaluation of GeMAPS concatenated with Mel-Filterbanks features using CH1 2–4 speakers adaptation set. Results demonstrate the effectiveness of different regularization techniques (Dropout, Weighted Sampling, and Label Smoothing) at various subsampling rates by measuring DER components, Speaker Counting, Speaker Turn Detection, and Overlap Speech Detection metrics.

Modification	Adapt Subsampling	DER Components (%)				Speaker Counting		Speaker Turn Detection			Overlap Speech Detection
Modification	Adapt Subsampling	Miss	FA	Spk. Error	Total	SCE	Accuracy	Precision	Recall	F1	Precision	Recall	F1
Dropout = 0.3	10	12.3	2.2	5.3	19.7	0.328	0.732	0.606	0.471	0.530	0.662	0.299	0.412
Dropout = 0.5		14.1	2.3	6.4	22.8	0.396	0.652	0.557	0.460	0.504	0.610	0.216	0.319
Weighted Sampling		12.3	3.9	5.8	22.1	0.356	0.724	0.600	0.453	0.516	0.552	0.339	0.420
Label Smoothing		11.4	3.3	5.8	20.5	0.356	0.724	0.599	0.450	0.514	0.599	0.335	0.429
Dropout = 0.3	5	12.3	1.6	5.0	18.9	0.320	0.732	0.609	0.478	0.535	0.705	0.285	0.406
Dropout = 0.5		14.9	2.0	6.0	22.9	0.436	0.616	0.551	0.455	0.499	0.650	0.224	0.334
Weighted Sampling		11.6	2.0	6.1	19.8	0.356	0.724	0.600	0.455	0.518	0.664	0.305	0.418
Label Smoothing		11.6	2.8	5.9	20.3	0.352	0.720	0.609	0.464	0.527	0.610	0.315	0.415

References

Raj, D.; Denisov, P.; Chen, Z.; Erdogan, H.; Huang, Z.; He, M.; Watanabe, S.; Du, J.; Yoshioka, T.; Luo, Y.; et al. Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System Description, Comparison, and Analysis. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 897–904. [Google Scholar] [CrossRef]
O’Shaughnessy, D. Speaker Diarization: A Review of Objectives and Methods. Appl. Sci. 2025, 15, 2002. [Google Scholar] [CrossRef]
Gupta, A.; Purwar, A. Enhancing speaker diarization for audio-only systems using deep learning. In Applications of Artificial Intelligence, Big Data and Internet of Things in Sustainable Development; CRC Press: Boca Raton, FL, USA, 2022; pp. 65–79. [Google Scholar]
Menon, N.G.; Shrivastava, A.; Bhavana, N.D.; Simon, J. Deep Learning based Transcribing and Summarizing Clinical Conversations. In Proceedings of the 2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Palladam, India, 11–13 November 2021; pp. 358–365. [Google Scholar] [CrossRef]
O’Sullivan, J.; Bogaarts, G.; Schoenenberger, P.; Tillmann, J.; Slater, D.; Mesgarani, N.; Eule, E.; Kilchenmann, T.; Murtagh, L.; Hipp, J.; et al. Automatic speaker diarization for natural conversation analysis in autism clinical trials. Sci. Rep. 2023, 13, 10270. [Google Scholar] [CrossRef] [PubMed]
Moattar, M.; Homayounpour, M. A review on speaker diarization systems and approaches. Speech Commun. 2012, 54, 1065–1103. [Google Scholar] [CrossRef]
Kanda, N.; Xiao, X.; Gaur, Y.; Wang, X.; Meng, Z.; Chen, Z.; Yoshioka, T. Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers Using End-to-End Speaker-Attributed ASR. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 8082–8086. [Google Scholar] [CrossRef]
Garcia-Romero, D.; Snyder, D.; Sell, G.; Povey, D.; McCree, A. Speaker diarization using deep neural network embeddings. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 4930–4934. [Google Scholar] [CrossRef]
Landini, F.; Profant, J.; Diez, M.; Burget, L. Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: Theory, implementation and analysis on standard tasks. Comput. Speech Lang. 2022, 71, 101254. [Google Scholar] [CrossRef]
Chang, S.Y.; Li, B.; Simko, G.; Sainath, T.N.; Tripathi, A.; van den Oord, A.; Vinyals, O. Temporal Modeling Using Dilated Convolution and Gating for Voice-Activity-Detection. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5549–5553. [Google Scholar] [CrossRef]
Dehak, N.; Kenny, P.J.; Dehak, R.; Dumouchel, P.; Ouellet, P. Front-End Factor Analysis for Speaker Verification. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 788–798. [Google Scholar] [CrossRef]
Dawalatabad, N.; Ravanelli, M.; Grondin, F.; Thienpondt, J.; Desplanques, B.; Na, H. ECAPA-TDNN Embeddings for Speaker Diarization. Proc. Interspeech 2021, 3560–3564. [Google Scholar] [CrossRef]
Raj, D.; Snyder, D.; Povey, D.; Khudanpur, S. Probing the Information Encoded in X-Vectors. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 726–733. [Google Scholar] [CrossRef]
Kwon, Y.; Heo, H.S.; Jung, J.W.; Kim, Y.J.; Lee, B.J.; Chung, J.S. Multi-Scale Speaker Embedding-Based Graph Attention Networks for Speaker Diarisation. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 8367–8371. [Google Scholar] [CrossRef]
Park, T.J.; Kanda, N.; Dimitriadis, D.; Han, K.J.; Watanabe, S.; Narayanan, S. A review of speaker diarization: Recent advances with deep learning. Comput. Speech Lang. 2022, 72, 101317. [Google Scholar] [CrossRef]
Serafini, L.; Cornell, S.; Morrone, G.; Zovato, E.; Brutti, A.; Squartini, S. An experimental review of speaker diarization methods with application to two-speaker conversational telephone speech recordings. Comput. Speech Lang. 2023, 82, 101534. [Google Scholar] [CrossRef]
Fujita, Y.; Kanda, N.; Horiguchi, S.; Nagamatsu, K.; Watanabe, S. End-to-End Neural Speaker Diarization with Permutation-Free Objectives. Proc. Interspeech 2019, 4300–4304. [Google Scholar] [CrossRef]
Fujita, Y.; Kanda, N.; Horiguchi, S.; Xue, Y.; Nagamatsu, K.; Watanabe, S. End-to-End Neural Speaker Diarization with Self-Attention. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 296–303. [Google Scholar] [CrossRef]
Horiguchi, S.; Fujita, Y.; Watanabe, S.; Xue, Y.; Nagamatsu, K. End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors. Proc. Interspeech 2020, 269–273. [Google Scholar] [CrossRef]
Xia, W.; Lu, H.; Wang, Q.; Tripathi, A.; Huang, Y.; Moreno, I.L.; Sak, H. Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 8077–8081. [Google Scholar] [CrossRef]
Zhao, G.; Wang, Q.; Lu, H.; Huang, Y.; Moreno, I.L. Augmenting Transformer-Transducer Based Speaker Change Detection with Token-Level Training Loss. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Kinoshita, K.; Delcroix, M.; Tawara, N. Integrating End-to-End Neural and Clustering-Based Diarization: Getting the Best of Both Worlds. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 7198–7202. [Google Scholar] [CrossRef]
Kinoshita, K.; Delcroix, M.; Tawara, N. Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. International Speech Communication Association, Brno, Czech Republic, 30 August–3 September 2021; Volume 4, pp. 2513–2517. [Google Scholar] [CrossRef]
Sun, G.; Zhang, C.; Woodland, P.C. Combination of deep speaker embeddings for diarisation. Neural Netw. 2021, 141, 372–384. [Google Scholar] [CrossRef] [PubMed]
Medennikov, I.; Korenevsky, M.; Prisyach, T.; Khokhlov, Y.; Korenevskaya, M.; Sorokin, I.; Timofeeva, T.; Mitrofanov, A.; Andrusenko, A.; Podluzhny, I.; et al. Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario. Proc. Interspeech 2020, 274–278. [Google Scholar] [CrossRef]
Watanabe, S.; Mandel, M.; Barker, J.; Vincent, E.; Arora, A.; Chang, X.; Khudanpur, S.; Manohar, V.; Povey, D.; Raj, D.; et al. CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings. In Proceedings of the 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020), Online, 4 May 2020. [Google Scholar] [CrossRef]
Wang, W.; Li, M. Incorporating End-to-End Framework Into Target-Speaker Voice Activity Detection. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 8362–8366. [Google Scholar] [CrossRef]
Wang, D.; Xiao, X.; Kanda, N.; Yoshioka, T.; Wu, J. Target Speaker Voice Activity Detection with Transformers and Its Integration with End-To-End Neural Diarization. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Toyoshima, I.; Okada, Y.; Ishimaru, M.; Uchiyama, R.; Tada, M. Multi-input speech emotion recognition model using mel spectrogram and GeMAPS. Sensors 2023, 23, 1743. [Google Scholar] [CrossRef] [PubMed]
Eyben, F.; Scherer, K.R.; Schuller, B.W.; Sundberg, J.; André, E.; Busso, C.; Devillers, L.Y.; Epps, J.; Laukka, P.; Narayanan, S.S.; et al. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. IEEE Trans. Affect. Comput. 2016, 7, 190–202. [Google Scholar] [CrossRef]
Pepino, L.; Riera, P.; Ferrer, L. Emotion Recognition from Speech Using wav2vec 2.0 Embeddings. Proc. Interspeech 2021, 3400–3404. [Google Scholar] [CrossRef]
Atmaja, B.T.; Akagi, M. On the differences between song and speech emotion recognition: Effect of feature sets, feature types, and classifiers. In Proceedings of the 2020 IEEE Region 10 Conference (TENCON), Osaka, Japan, 16–19 November 2020; pp. 968–972. [Google Scholar] [CrossRef]
Alvarez-Trejos, J.I.; Labrador, B.; Lozano-Diez, A. Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker Scenarios. In Proceedings of the Odyssey 2024: The Speaker and Language Recognition Workshop, Quebec, QC, Canada, 18–21 June 2024. [Google Scholar] [CrossRef]
Martin, A.; Przybocki, M. The NIST 1999 Speaker Recognition Evaluation—An Overview. Digit. Signal Process. 2000, 10, 1–18. [Google Scholar] [CrossRef]
Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlicek, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, Waikoloa, HI, USA, 11–15 December 2011. [Google Scholar]
Landini, F.; Lozano-Diez, A.; Diez, M.; Burget, L. From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization. Proc. Interspeech 2022, 5095–5099. [Google Scholar] [CrossRef]
Godfrey, J.; Holliman, E.; McDaniel, J. SWITCHBOARD: Telephone speech corpus for research and development. In Proceedings of the ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, San Francisco, CA, USA, 23–26 March 1992; Volume 1, pp. 517–520. [Google Scholar] [CrossRef]
Przybocki, M.; Martin, A.F. NIST speaker recognition evaluation chronicles. In Proceedings of the Speaker and Language Recognition Workshop (Odyssey 2004), Toledo, Spain, 1–3 June 2004; pp. 15–22. [Google Scholar]
Sadjadi, O.; Greenberg, C.; Singer, E.; Mason, L.; Reynolds, D. NIST 2021 Speaker Recognition Evaluation Plan, NIST SRE. 2021. Available online: https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=932697 (accessed on 25 April 2024).
Przybocki, M.A.; Martin, A.F.; Le, A.N. NIST Speaker Recognition Evaluation Chronicles—Part 2. In Proceedings of the 2006 IEEE Odyssey—The Speaker and Language Recognition Workshop, San Juan, PR, USA, 28–30 June 2006; pp. 1–6. [Google Scholar] [CrossRef]
Martin, A.F.; Greenberg, C.S. NIST 2008 speaker recognition evaluation: Performance across telephone and room microphone channels. In Proceedings of the Tenth Annual Conference of the International Speech Communication Association, Brighton, UK, 6–10 September 2009. [Google Scholar]
Snyder, D.; Chen, G.; Povey, D. Musan: A music, speech, and noise corpus. arXiv 2015, arXiv:1510.08484. [Google Scholar]
Ko, T.; Peddinti, V.; Povey, D.; Seltzer, M.L.; Khudanpur, S. A study on data augmentation of reverberant speech for robust speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5220–5224. [Google Scholar] [CrossRef]
Bredin, H. Pyannote.audio 2.1 Speaker Diarization Pipeline: Principle, Benchmark, and Recipe. Proc. Interspeech 2023, 1983–1987. [Google Scholar] [CrossRef]

Figure 1. EEND-EDA architecture overview showing alternative feature configurations. The model processes input audio through parallel streams: either standalone Mel-filterbanks (

X \in R^{T \times M}

) or a concatenated representation (

F \in R^{T \times (M + E)}

) combining Mel-filterbanks with auxiliary features (

B \in R^{T \times E}

). The SA-EEND encoder generates embeddings (

e \in R^{T \times D}

), which the EDA module transforms into speaker-specific attractors (

a_{s}

) for final diarization output.

Figure 1. EEND-EDA architecture overview showing alternative feature configurations. The model processes input audio through parallel streams: either standalone Mel-filterbanks (

X \in R^{T \times M}

) or a concatenated representation (

F \in R^{T \times (M + E)}

) combining Mel-filterbanks with auxiliary features (

B \in R^{T \times E}

). The SA-EEND encoder generates embeddings (

e \in R^{T \times D}

), which the EDA module transforms into speaker-specific attractors (

a_{s}

) for final diarization output.

Figure 2. Distribution of audio duration in hours by number of speakers for CallHome training (CH1) and testing (CH2) datasets.

Figure 3. Results obtained on CH2-2spk dataset using different features. Shaded regions indicate experiments where the corresponding features were concatenated with MFB features. For each feature type. the bars show the DER with and without Oracle VAD, as well as with and without domain adaptation (striped bars indicate experiments with domain adaptation, solid bars without domain adaptation).

Figure 4. The degradation is measured as

Δ

DER% when computing the mean of each feature group per audio file during inference.

Figure 4. The degradation is measured as

Δ

DER% when computing the mean of each feature group per audio file during inference.

Figure 5. Performance comparison of different GeMAPS subsets and context sizes on the CH2 2-spk set, with adaptation on CH1 2-spk set. Three feature configurations are compared: Complete GeMAPS, F0 + Formants, and a Reduced Set, across varying context sizes from 0 to 8.

Figure 6. Performance comparison across feature configurations when adapting to different CH1 subsets. Radar plots show multiple evaluation metrics for: (a) MFB baseline, (b) ECAPA-TDNN ⊕ MFB, and (c) GeMAPS ⊕ MFB. Each plot compares adaptation using the complete CH1 all set (2–7 speakers) versus using only conversations with 2–4 speakers. Beside each metric, arrows indicate whether lower (↓) or higher (↑) values are better. Values in parentheses show the metric scores for (CH1 all, CH1-2–4spk) respectively. Numerical ranges shown in brackets [min–max] for each axis are individually adjusted per metric to highlight relative differences between adaptation strategies. Metrics displayed include DER, SCE, Speaker Count Accuracy, Turn Detection F1, and OSD F1.

Figure 7. Impact of dropout rate on DER performance across different feature configurations. Results compare adaptation using complete CH1 all (2–7 speakers) versus restricted set (2–4 speakers). Each subplot shows a different feature configuration: (left) MFB baseline features, (middle) ECAPA-TDNN ⊕ MFB features, and (right) GeMAPS ⊕ MFB features.

Figure 8. Impact of weighted batch sampling on speaker counting performance across different feature configurations and adaptation strategies. Results compare standard sampling versus weighted sampling for both complete (CH1 all) and restricted (CH1 2–4 spk) adaptation sets. Top: Speaker Count Accuracy (%). Bottom: Speaker Count Error.

Figure 9. Impact of label smoothing and subsampling rate applied during adaptation on system performance across different feature configurations. Top row shows DER (%) variations; bottom row shows Speaker Turn Detection F1-score changes. Each column represents a different feature configuration.

Figure 10. Performance gains from enhanced temporal resolution across feature configurations, with consistent regularization (dropout = 0.3) and adaptation strategy (CH1 2–4 spk). Left: Diarization Error Rate (DER), where lower values indicate better performance. Right: Speaker Turn Detection F1-score, where higher values indicate better performance. Green percentages indicate relative improvements.

Table 1. Distribution of audio recordings by number of speakers in the CallHome database for training (CH1) and testing (CH2) datasets. The columns represent the number of speakers per recording, while values indicate the total count of audio files in each category.

Dataset	Number of Speakers						Total
Dataset	2	3	4	5	6	7	Total
CH1 (Train)	155	61	23	5	3	2	249
CH2 (Test)	148	74	20	5	3	0	250

Table 2. Diarization Error Rate (DER) components for best feature configurations. For ECAPA-TDNN concatenated with MFB features, oracle VAD was employed during both training and inference phases. Models are adapted to CH1-2spk and evaluated on CH2-2spk.

Feature Type	Conditions	Adaptation	DER Components (%)
Feature Type	Conditions	Adaptation	Miss	FA	Spk. Error	Total
MFB (baseline)	No VAD	✗	2.50	6.60	0.99	10.09
MFB (baseline)	No VAD	✓	3.40	3.87	0.80	8.07
ECAPA-TDNN ⊕ MFB	Oracle VAD	✗	3.55	3.10	3.21	9.86
ECAPA-TDNN ⊕ MFB	Oracle VAD	✓	3.0	3.05	1.15	7.20
GeMAPS ⊕ MFB	No VAD	✗	3.50	5.70	1.55	10.75
GeMAPS ⊕ MFB	No VAD	✓	5.74	2.40	0.90	9.04

Table 3. Performance comparison of different GeMAPS feature configurations on CH2-2spk. The Proposed Reduced Set excludes voiced segments and jitter/shimmer features based on the ablation study results.

Configuration	Number of Features	DER (%)	$Δ$ DER(%)
Complete GeMAPS	62	13.60	-
F0 + Formants	24	45.33	+31.73%
Proposed Reduced Set	52	13.93	+0.33%

Table 4. Performance comparison between the original baseline [36] and our implementation on CH2 evaluation set. Results are shown for different training configurations: two-speaker simulated conversations (SC 2 spkr), one-to-four speaker simulated conversations (SC 1–4 spkr), and after adaptation using the complete CH1 set (CH1 all). DER (%) is reported for the complete evaluation set (All) and broken down by the number of speakers in each recording.

	Training Set	All	2 spk	3 spk	4 spk	5 spk	6 spk
Baseline	SC 2 spkr	20.86	8.48	21.07	29.56	45.61	49.2
	SC 1–4 spkr	16.18	8.95	13.78	21.22	37.35	46.32
	CH1 all	16.07	10.03	14.35	19.3	30.67	46.94
Baseline (ours)	SC 2 spkr	22.24	10.33	22.71	30.64	49.18	48.65
	SC 1–4 spkr	18.28	9.91	16.47	24.29	45.88	46.35
	CH1 all	16.8	7.99	15.42	23.47	37.17	41.07

Table 5. Performance metrics for our baseline implementation after CH1 adaptation on the CH2 evaluation set. Results demonstrate the system’s capabilities across different aspects of the diarization task.

Task	Metric	Performance
Speaker Count	Accuracy (%)	73.20
Speaker Count	Speaker Counting Error	0.348
Speaker Turn Detection	Precision (%)	62.41
	Recall (%)	46.31
	F1-score (%)	53.16
Overlap Speech Detection	Precision (%)	64.90
	Recall (%)	42.40
	F1-score (%)	51.30
	Accuracy (%)	87.95

Table 6. DER (%) performance comparison on CH2 evaluation set across different feature configurations at training stage. Results are broken down by the number of speakers in each recording.

	Training Set	All	2 spk	3 spk	4 spk	5 spk	6 spk
Baseline (ours)	SC 2spkr	22.24	10.33	22.71	30.64	49.18	48.65
	SC 1–4 spkr	18.28	9.91	16.47	24.29	45.88	46.35
	CH1 all	16.8	7.99	15.42	23.47	37.17	41.07
ECAPA-TDNN ⊕ MFB	SC 2spkr	26.66	12.31	26.12	38.41	44.22	57.37
	SC 1–4 spkr	27.41	14.6	24.24	41.09	52.91	54.82
	CH1 all	20.95	10.45	19.00	31.79	38.44	51.32
GeMAPS ⊕ MFB	SC 2spkr	24.90	10.75	27.45	36.39	48.50	54.02
	SC 1–4 spkr	22.91	11.01	23.99	30.56	45.23	49.15
	CH1 all	23.21	14.34	22.49	29.38	42.27	49.44

Table 7. Comprehensive performance metrics comparison across different feature configurations on the CH2 evaluation set after CH1 adaptation. Results show that while feature combinations improve certain aspects of performance, they introduce different trade-offs in speaker discrimination capabilities.

Task	Metric	MFB	ECAPA-TDNN ⊕ MFB	GeMAPS ⊕ MFB
Speaker Count	Accuracy (%)	73.20	64.00	66.40
Speaker Count	SCE	0.348	0.444	0.428
Speaker Turn Det.	Precision (%)	62.41	56.20	54.58
	Recall (%)	46.31	37.97	43.17
	F1-score (%)	53.16	45.32	48.21
Overlap Speech Det.	Precision (%)	64.90	42.50	55.80
	Recall (%)	42.40	45.60	36.30
	F1-score (%)	51.30	44.00	44.00
	Accuracy (%)	87.95	82.64	86.17

Table 8. Performance comparison of different dropout rates across multiple metrics for MFB features. Results are shown for both complete (CH1 all) and restricted (CH1 2–4 spk) adaptation sets. P indicates Precision and R indicates Recall.

Adaptation	Dropout	Speaker Count		Speaker Turn Detection			Overlap Speech Detection
Adaptation	Dropout	SCE	Acc.	P	R	F1	P	R	F1	Acc.
CH1 all	0.1	0.348	73.6	62.41	46.31	53.16	64.9	42.4	51.3	87.95
	0.3	0.332	73.2	61.94	48.07	54.13	65.6	42.0	51.2	88.03
	0.5	0.436	66.0	60.20	45.06	51.57	69.4	19.4	30.4	86.67
CH1 2–4 spk	0.1	0.320	74.4	62.63	47.51	53.97	64.4	44.5	52.6	88.02
	0.3	0.312	74.8	62.68	49.41	55.26	66.5	42.0	51.5	88.16
	0.5	0.364	68.0	59.36	47.70	52.85	61.8	25.9	36.5	86.52

Table 9. Performance comparison of different dropout rates across multiple metrics for ECAPA-TDNN ⊕ MFB features. Results are shown for both complete (CH1 all) and restricted (CH1 2–4 spk) adaptation sets. P indicates Precision and R indicates Recall.

Adaptation	Dropout	Speaker Count		Speaker Turn Detection			Overlap Speech Detection
Adaptation	Dropout	SCE	Acc.	P	R	F1	P	R	F1	Acc.
CH1 all	0.1	0.444	64.0	56.20	37.97	45.32	42.5	45.6	44.0	82.64
	0.3	0.384	70.0	58.86	38.69	46.69	43.3	45.1	44.2	82.96
	0.5	0.592	57.2	55.33	36.10	43.69	41.2	44.3	42.7	81.32
CH1 2–4 spk	0.1	0.376	68.4	59.84	40.54	48.33	45.2	47.0	46.1	83.55
	0.3	0.384	68.8	60.41	41.06	48.89	50.5	41.3	45.4	85.16
	0.5	0.464	62.8	57.06	38.17	45.74	44.3	37.4	40.6	83.61

Table 10. Performance comparison of different dropout rates across multiple metrics for GeMAPS ⊕ MFB features. Results are shown for both complete (CH1 all) and restricted (CH1 2–4 spk) adaptation sets. P indicates Precision and R indicates Recall.

Adaptation	Dropout	Speaker Count		Speaker Turn Detection			Overlap Speech Detection
Adaptation	Dropout	SCE	Acc.	P	R	F1	P	R	F1	Acc.
CH1 all	0.1	0.428	66.4	54.58	43.17	48.21	55.8	36.3	44.0	86.17
	0.3	0.524	59.2	56.77	43.51	49.26	38.4	33.6	35.9	82.02
	0.5	0.636	50.0	51.58	38.21	43.90	53.0	15.1	23.4	85.30
CH1 2–4 spk	0.1	0.415	67.3	53.40	42.12	47.09	56.2	37.9	45.3	86.29
	0.3	0.328	73.2	60.63	47.08	53.00	66.2	29.9	41.2	87.22
	0.5	0.396	65.2	55.74	45.97	50.39	61.0	21.6	31.9	86.21

Table 11. Performance comparison of weighted batch sampling across multiple metrics for MFB features. Results are shown for both complete (CH1 all) and restricted (CH1 2–4 spk) adaptation sets. Speaker counting metrics are shown in Figure 8. P indicates Precision and R indicates Recall.

Adaptation	Strategy	DER Components (%)				Speaker Turn Detection			Overlap Speech Detection
Adaptation	Strategy	Miss	FA	Spk. Error	Total	P	R	F1	P	R	F1
CH1 all	No Weight	9.0	2.6	5.2	16.8	62.41	46.31	53.16	64.9	42.4	51.3
CH1 all	Weighted	7.7	3.4	5.8	16.93	57.85	45.30	50.81	62.0	42.7	50.6
CH1 2–4 spk	No Weight	8.0	2.7	4.9	15.64	62.63	47.51	53.97	64.4	44.5	52.6
CH1 2–4 spk	Weighted	11.1	2.2	5.8	19.08	62.24	46.98	53.51	65.1	31.4	42.3

Table 12. Performance comparison of weighted batch sampling across multiple metrics for ECAPA-TDNN ⊕ MFB features. Results are shown for both complete (CH1 all) and restricted (CH1 2–4 spk) adaptation sets. Speaker counting metrics are shown in Figure 8. P indicates Precision and R indicates Recall.

Adaptation	Strategy	DER Components (%)				Speaker Turn Detection			Overlap Speech Detection
Adaptation	Strategy	Miss	FA	Spk. Error	Total	P	R	F1	P	R	F1
CH1 all	No Weight	5.0	8.4	7.6	20.95	56.20	37.97	45.32	42.5	45.6	44.0
CH1 all	Weighted	8.2	7.7	8.0	23.86	55.45	36.88	44.30	40.5	44.6	42.5
CH1 2–4 spk	No Weight	5.3	6.3	5.5	17.09	59.84	40.54	48.33	45.2	47.0	46.1
CH1 2–4 spk	Weighted	5.3	6.2	5.3	16.87	59.91	40.56	48.37	45.5	47.5	46.5

Table 13. Performance comparison of weighted batch sampling across multiple metrics for GeMAPS ⊕ MFB features. Results are shown for both complete (CH1 all) and restricted (CH1 2–4 spk) adaptation sets. Speaker counting metrics are shown in Figure 8. P indicates Precision and R indicates Recall.

Adaptation	Strategy	DER Components (%)				Speaker Turn Detection			Overlap Speech Detection
Adaptation	Strategy	Miss	FA	Spk. Error	Total	P	R	F1	P	R	F1
CH1 all	No Weight	10.0	5.2	8.0	23.21	54.58	43.17	48.21	55.8	36.3	44.0
CH1 all	Weighted	17.6	3.0	7.8	28.35	54.78	40.57	46.62	55.3	26.1	35.5
CH1 2–4 spk	No Weight	9.6	4.8	8.1	22.50	53.40	42.12	47.09	56.2	37.9	45.3
CH1 2–4 spk	Weighted	12.3	3.9	5.8	22.09	59.95	45.34	51.63	55.2	33.9	42.0

Table 14. Performance comparison of label smoothing and subsampling rate across multiple metrics for MFB features. Results are shown for restricted adaptation set (CH1 2–4 spk). P indicates Precision and R indicates Recall.

Strategy	Subsampling	DER Components (%)				Speaker Turn Detection			Overlap Speech Detection
Strategy	Subsampling	Miss	FA	Spk. Error	Total	P	R	F1	P	R	F1
No Smooth	10	8.0	2.7	4.9	15.64	62.63	47.51	53.97	64.4	44.5	52.6
No Smooth	5	8.3	2.6	5.1	16.0	63.5	48.0	54.7	64.8	45.0	53.0
Smooth	10	7.8	4.2	5.4	17.39	60.60	44.70	51.44	56.0	47.0	51.4
Smooth	5	8.0	2.8	5.7	16.43	62.49	46.97	53.64	63.1	44.9	52.5

Table 15. Performance comparison of label smoothing and subsampling rate across multiple metrics for ECAPA-TDNN ⊕ MFB features. Results are shown for restricted adaptation set (CH1 2–4 spk). P indicates Precision and R indicates Recall.

Strategy	Subsampling	DER Components (%)				Speaker Turn Detection			Overlap Speech Detection
Strategy	Subsampling	Miss	FA	Spk. Error	Total	P	R	F1	P	R	F1
No Smooth	10	5.3	6.3	5.5	17.09	59.84	40.54	48.33	45.2	47.0	46.1
No Smooth	5	5.2	6.2	5.4	16.8	60.5	41.0	49.0	45.5	47.5	46.5
Smooth	10	6.9	7.4	5.7	19.96	58.69	38.19	46.27	42.3	43.4	42.8
Smooth	5	5.1	6.0	5.3	16.4	61.0	41.5	49.5	46.0	48.0	47.0

Table 16. Performance comparison of label smoothing and subsampling rate across multiple metrics for GeMAPS ⊕ MFB features. Results are shown for restricted adaptation set (CH1 2–4 spk). P indicates Precision and R indicates Recall.

Strategy	Subsampling	DER Components (%)				Speaker Turn Detection			Overlap Speech Detection
Strategy	Subsampling	Miss	FA	Spk. Error	Total	P	R	F1	P	R	F1
No Smooth	10	9.6	4.8	8.1	22.5	53.40	42.12	47.09	56.2	37.9	45.3
No Smooth	5	11.8	3.0	6.0	20.8	60.5	45.8	52.2	60.5	31.0	41.0
Smooth	10	11.4	3.3	5.8	20.52	59.93	45.01	51.41	59.9	33.5	42.9
Smooth	5	11.6	2.8	5.9	20.3	60.94	46.35	52.65	61.0	31.5	41.5

Table 17. Error component analysis for ECAPA-TDNN ⊕ MFB and GeMAPS ⊕ MFB with enhanced temporal resolution (subsampling = 5). All configurations use dropout = 0.3 and restricted adaptation (CH1 2–4 spk).

Feature	DER Components (%)
Configuration	Miss	FA	Speaker Error	Total
MFB	8.9	2.1	4.7	15.76
ECAPA-TDNN ⊕ MFB	5.7	3.7	5.5	14.89
GeMAPS ⊕ MFB	12.3	1.6	5.0	18.87

Table 18. Performance comparison of optimal configurations in terms of DER% for each feature type. All configurations use restricted adaptation (CH1 2–4 spk), dropout = 0.3, and subsampling = 5.

Feature	Error Components (%)			DER	Speaker Count	Turn Det.	OSD
Configuration	Miss	FA	Spk. Error	(%)	Acc. (%)	F1 (%)	F1 (%)
MFB	8.0	2.7	4.9	15.64	74.4	53.97	52.6
ECAPA-TDNN ⊕ MFB	5.7	3.7	5.5	14.89	68.8	50.53	47.0
GeMAPS ⊕ MFB	12.3	1.6	5.0	18.87	73.2	52.65	41.5

Table 19. Performance comparison across adaptation enhancements for each feature configuration. Results show DER (%) and relative improvement compared to the standard configuration (CH1 all).

Configuration	MFB	ECAPA-TDNN ⊕ MFB	GeMAPS ⊕ MFB
Standard (CH1 all)	16.80	20.95	23.21
Restricted adaptation (CH1 2–4 spk)	15.64 (−6.9%)	17.09 (−18.4%)	22.50 (−3.1%)
Restricted adapt. + dropout (0.3)	16.17 (−3.8%)	15.80 (−24.6%)	19.74 (−14.9%)
All optimizations (subsampling = 5)	15.76 (−6.2%)	14.89 (−29.0%)	18.87 (−18.7%)
* Full enhancement stack *	16.16 (−3.8%)	15.00 (−28.4%)	19.48 (−16.1%)

* Includes restricted adaptation, dropout = 0.3, subsampling = 5, label smoothing and weighted batch sampling.

Table 20. Performance comparison between our optimized systems and state-of-the-art approaches on the CH2 evaluation set. Results are shown for both collared (250 ms) and collar-free evaluation, applying different subsampling factors during inference for our proposed systems.

System	Collar	DER (%)
System	Collar	Subsampling = 10, Median = 11	Subsampling = 5, Median = 5
MFB	✓	15.72	16.76
MFB	✗	26.06	23.95
ECAPA-TDNN ⊕ MFB	✓	14.89	15.59
ECAPA-TDNN ⊕ MFB	✗	25.78	22.69
GeMAPS ⊕ MFB	✓	18.87	23.25
GeMAPS ⊕ MFB	✗	29.19	26.94
VBx [9]	✓	14.21
VBx [9]	✗	21.77
AHC [9]	✓	17.64
AHC [9]	✗	25.61
Pyannote [44]	✓	-
Pyannote [44]	✗	29.30

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alvarez-Trejos, J.I.; Lozano-Diez, A.; Ramos, D. Feature Integration Strategies for Neural Speaker Diarization in Conversational Telephone Speech. Appl. Sci. 2025, 15, 4842. https://doi.org/10.3390/app15094842

AMA Style

Alvarez-Trejos JI, Lozano-Diez A, Ramos D. Feature Integration Strategies for Neural Speaker Diarization in Conversational Telephone Speech. Applied Sciences. 2025; 15(9):4842. https://doi.org/10.3390/app15094842

Chicago/Turabian Style

Alvarez-Trejos, Juan Ignacio, Alicia Lozano-Diez, and Daniel Ramos. 2025. "Feature Integration Strategies for Neural Speaker Diarization in Conversational Telephone Speech" Applied Sciences 15, no. 9: 4842. https://doi.org/10.3390/app15094842

APA Style

Alvarez-Trejos, J. I., Lozano-Diez, A., & Ramos, D. (2025). Feature Integration Strategies for Neural Speaker Diarization in Conversational Telephone Speech. Applied Sciences, 15(9), 4842. https://doi.org/10.3390/app15094842

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Feature Integration Strategies for Neural Speaker Diarization in Conversational Telephone Speech

Abstract

1. Introduction

2. Methodology

2.1. Model Architecture

2.2. Feature Processing Framework

Feature Combination Strategy

2.3. Adaptation Methodology

2.3.1. Domain-Focused Adaptation

2.3.2. Regularization Methods

2.3.3. Combination of Strategies

3. Experimental Setup

3.1. Implementation Details

3.1.1. Feature Extraction Configuration

3.1.2. Model Configuration

3.2. Training and Adaptation Protocol

3.2.1. Dataset Specifications

3.2.2. Training Configuration

3.3. Evaluation Framework

3.3.1. Diarization Error Rate

3.3.2. Overlapped Speech Detection Performance

3.3.3. Speaker Count Estimation

3.3.4. Speaker Turn Detection

4. Results and Analysis

4.1. Two-Speaker Scenario Performance

4.1.1. GeMAPS Analysis

Feature Group Importance (Inference Stage)

Feature Set Optimization (Training Stage)

Impact of Temporal Context

4.2. Multi-Speaker Scenario

4.2.1. Baseline Comparison

4.2.2. Feature Performance in Multi-Speaker Scenarios

4.2.3. Domain-Focused Adaptation Analysis

4.2.4. Enhanced Adaptation Techniques

Dropout Regularization

Class-Balanced Learning

Label Smoothing with Enhanced Temporal Resolution

Adjusting Subsampling Rate

4.3. Optimal Feature Configurations Analysis

4.4. Comparison with State-of-the-Art Systems

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Full Results for Mel-Filterbanks

Appendix A.2. Full Results for ECAPA-TDNN ⊕ Mel-Filterbanks

Appendix A.3. Full Results for GeMAPS ⊕ Mel-Filterbanks

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI