The Speaker Identification Model for Air-Ground Communication Based on a Parallel Branch Architecture

Pan, Weijun; Chen, Shenhao; Wang, Yidi; Chen, Sheng; Wang, Xuan

doi:10.3390/app15062994

Open AccessArticle

The Speaker Identification Model for Air-Ground Communication Based on a Parallel Branch Architecture

by

Weijun Pan

,

Shenhao Chen

^*,

Yidi Wang

,

Sheng Chen

and

Xuan Wang

College of Air Traffic Management, Civil Aviation Flight University of China, Guanghan 618307, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(6), 2994; https://doi.org/10.3390/app15062994

Submission received: 22 January 2025 / Revised: 3 March 2025 / Accepted: 5 March 2025 / Published: 10 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

This study addresses the challenges of complex noise and short speech in civil aviation air-ground communication scenarios and proposes a novel speaker identification model, Chrono-ECAPA-TDNN (CET). The aim of the study is to enhance the accuracy and robustness of speaker identification in these environments. The CET model incorporates three key components: the Chrono Block module, the speaker embedding extraction module, and the optimized loss function module. The Chrono Block module utilizes parallel branching architecture, Bi-LSTM, and multi-head attention mechanisms to effectively extract both global and local features, addressing the challenge of short speech. The speaker embedding extraction module aggregates features from the Chrono Block and employs self-attention statistical pooling to generate robust speaker embeddings. The loss function module introduces the Sub-center AAM-Softmax loss, which improves feature compactness and class separation. To further improve robustness, data augmentation techniques such as speed perturbation, spectral masking, and random noise suppression are applied. Pretraining on the VoxCeleb2 dataset and testing on the air-ground communication dataset, the CET model achieves 9.81% EER and 88.62% accuracy, outperforming the baseline ECAPA-TDNN model by 1.53% in EER and 2.19% in accuracy. The model also demonstrates strong performance on four cross-domain datasets, highlighting its broad potential for real-time applications.

Keywords:

speaker identification; parallel branch architecture; air-ground communication

1. Introduction

With the rapid development of the global aviation industry and the continuous increase in flight traffic, the field of air traffic management faces severe challenges, particularly in terms of flight safety and efficient scheduling. Traditional air traffic control methods that rely on human experience are insufficient to meet the growing demands of air transport. In recent years, the rise of artificial intelligence technologies has provided new solutions for the automation and intelligence of air traffic control systems, with the intelligent processing of air-ground communication becoming a key direction for addressing the “human-machine” interaction bottleneck.

Air-ground communication refers to the voice exchange between air traffic controllers and pilots via semi-duplex very high frequency (VHF) radios. This form of communication involves multiple participants and rounds of dialogue, posing high demands on the accuracy and robustness of speech recognition. Currently, the analysis of air-ground communication primarily relies on manual efforts, which not only require professionals to invest a significant amount of time and energy but also face growing inefficiency and unreliability as the volume of communication increases. In this context, speaker recognition technology can effectively assist in the automation of air-ground communication analysis. By accurately distinguishing between the identities of controllers and pilots involved in the conversation, it can not only improve the efficiency of communication content parsing but also provide technical support for identifying potential communication errors, thus reducing the incidence of safety accidents and ensuring aviation safety.

Speaker recognition (SR) technology is generally divided into two categories based on the application purpose: speaker identification [1] and speaker verification [1]. The task of speaker identification is to recognize the speaker of a particular segment of speech from multiple speakers, mainly used in speaker recognition scenarios involving multiple participants. On the other hand, speaker verification is the task of determining whether a segment of speech comes from a specific target speaker, commonly found in applications such as voiceprint locks and voiceprint authentication [2] for security purposes. Additionally, based on the working method of speaker recognition, it can be further classified into text-dependent [3] and text-independent [3] types. Text-dependent speaker recognition requires the speaker to record speech based on a specific text. Although this method has high recognition accuracy, its flexibility is limited, making it suitable only for scenarios with known text. In contrast, text-independent speaker recognition is unrestricted and can recognize speech from any arbitrary segment, making it highly flexible and convenient for widespread use in industrial applications. Given the diversity and uncertainty of speech in air-ground communication, as well as the complexity of situations involving multiple speakers in the channel, this study focuses on text-independent speaker identification technology to meet practical application needs.

Currently, researchers have extensively studied deep learning-based speaker identification technology to replace traditional statistical models, aiming to improve recognition accuracy and robustness. Snyder et al. [4] proposed the x-vector model, which uses multi-layer Time Delay Neural Networks (TDNNs) and a statistical pooling layer to transform frame-level features into sentence-level feature representations. These representations are then passed through a fully connected layer to obtain the speaker embedding, which effectively captures the speaker’s identity features and is used to distinguish different speakers in subsequent recognition tasks. In 2020, Desplanques et al. [5] enhanced the TDNN-based x-vector architecture by proposing the Emphasized Channel Attention, Propagation and Aggregation in TDNN (ECAPA-TDNN) model. This model incorporates improvements such as the Squeeze-and-Excitation Networks (SE-Net) module, channel attention mechanisms, and multi-layer feature fusion, significantly boosting the accuracy of speaker recognition. The ECAPA-TDNN model has since become one of the most advanced frameworks in the field of speaker recognition. However, these models perform sub-optimally in air-ground communication scenarios, primarily due to the distinct characteristics of air-ground communication, which differ significantly from general speech, presenting the following challenges for speaker recognition in air-ground communication:

Air-ground communication is affected by complex background noise inside the cabin [6] and radio transmission interference, with noise exhibiting non-stationary characteristics and a low signal-to-noise ratio. This complex noise imposes higher demands on the robustness of the ECAPA-TDNN model, making it difficult for the model to extract stable speaker features.
The speech segments in air-ground communication are mostly shorter than 8 s, which are considered typical short speech. Due to the limited speech and speaker information contained in short speech, the model’s feature extraction capability is required to be stronger, as it needs to extract effective speaker features from the limited audio information. However, the ECAPA-TDNN model mainly focuses on modeling local spatial features and lacks effective integration of global context information, making it unable to fully handle long-range dependencies. As a result, this model does not perform optimally in air-ground communication scenarios.

To improve the performance of speaker identification systems in air-ground communication scenarios and overcome the limitations of the ECAPA-TDNN model proposed by Desplanques et al. [5] in such environments, this paper presents an easy-to-implement and effective speaker identification network model—Chrono-ECAPA-TDNN (CET). Using ECAPA-TDNN as the baseline network, three improvements are made to address the challenges of complex noise and short speech in air-ground communication.

The main contributions of this paper can be summarized as follows:

(1): Addressing the challenge of complex noise: data augmentation techniques, including speed perturbation, spectral masking, and random noise suppression, are applied to the audio to enhance the model’s robustness in complex noise environments.
(2): Addressing the challenge of short speech: a parallel branch architecture is introduced in the air-ground communication speaker identification model (Chrono-ECAPA-TDNN). Using raw air-ground communication speech as input, a bidirectional long short-term memory (Bi-LSTM) network is added to the existing residual neural backbone of the ECAPA-TDNN model to capture speaker information at both global and local levels. A multi-head attention mechanism is also incorporated for feature enhancement.
(3): Improving model accuracy: to improve the accuracy of the air-ground communication speaker identification model, the Sub-center AAM-Softmax (Angular Additive Margin Softmax) is used instead of the AM-Softmax (additive margin Softmax). This modification tightens features within the same class and increases the separation between classes in the feature space, thus improving the model’s classification and discrimination ability, especially for categories with similar features.

In this paper, the Chrono-ECAPA-TDNN model was pre-trained on the VoxCeleb2 [7] dataset and then tested on the air-ground communication dataset, as well as the TIMIT [8] dataset, LibriSpeech [9], VoxCeleb1 [10], and CN-Celeb [11]. Performance comparisons were made with baseline networks including ECAPA-TDNN, Residual Network 34 (ResNet34), and Multi-scale Feature Aggregation Conformer (MFA-Conformer). The test results demonstrate that Chrono-ECAPA-TDNN outperforms existing models in both air-ground communication scenarios and across domain open-source datasets, with a significant reduction in Equal Error Rate (EER) and a notable improvement in accuracy.

2. Related Works

In recent years, with the rapid development of neural networks, neural models have been widely used in speaker identification tasks. In this context, various speaker identification methods based on Deep Neural Networks (DNNs), Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and their variants have been proposed [12,13,14,15,16,17].

Variani et al. [18] proposed a frame-level acoustic feature model based on DNNs, where the average value of specific features from the last layer of a trained DNN is used as the speaker model, known as the d-vector. Compared to the traditional i-vector [19] method, the d-vector performs better in small-scale, text-dependent speaker identification tasks. To overcome the limitation of d-vector, which only extracts frame-level features, Snyder et al. [4] proposed the x-vector model. This model uses a multi-layer Time Delay Neural Network (TDNN) and a statistical pooling layer to transform frame-level features into sentence-level feature representations, which are then converted into speaker embeddings via a fully connected layer.

In 2017, Vydana et al. [20] proposed a model based on Residual Networks (ResNets), which introduces residual connections into deep network models to extract more discriminative voiceprint features. However, the feature extraction method of ResNets is relatively simple. Building on residual networks, Chung et al. [21] proposed the ResNetSE34L and ResNetSE34V2 models, which employ convolution kernels of different scales to enhance the representation of multi-scale features.

In 2020, Desplanques et al. [5] further enhanced the TDNN-based x-vector architecture by proposing the ECAPA-TDNN model. In 2023, Liu and Qian [22] proposed ECAPA++, an enhanced ECAPA-TDNN model incorporating increased depth, recursive convolution, and a pyramid-based multi-path feature enhancement module. ECAPA++ achieved up to 25% relative improvement on VoxCeleb benchmarks while significantly reducing computational complexity, demonstrating efficiency comparable to state-of-the-art ResNet-based systems. In 2024, Zhang et al. [23] proposed SC-Ecapa Tdnn, an enhanced ECAPA-TDNN model integrating depth-separable convolutions and adaptive 1D convolution-based channel attention. This architecture improves speaker feature extraction while balancing computational efficiency. Experiments on AISHELL and CN-Celeb datasets demonstrate its superior performance over mainstream speaker recognition systems.

Although significant progress has been made in speaker identification technology based on deep learning in recent years, with good performance in general scenarios, these models still face multiple challenges, such as complex noise interference and limited short speech information.

In response to complex noise environments, Zouhir Y. [24] proposed a bionic cepstral coefficient (BCC) method in 2024, which combines a bionic wavelet filter bank (BWFB) with the Equivalent Rectangular Bandwidth (ERB) rate scale for feature extraction, significantly improving the performance of speaker identification models in noisy environments. In 2024, Chauhan [25] and others adopted feature optimization strategies, including genetic algorithms and marine predator algorithms, combined with feature fusion and dimensionality reduction techniques. These methods significantly improved speaker identification accuracy and robustness on various noisy datasets such as TIMIT and VoxCeleb1.

However, these approaches are computationally expensive and require high computational resources and parameter tuning. In contrast, data augmentation methods, due to their simplicity and ability to mitigate the impact of complex noise on the model, are widely used.

In 2023, Salazar et al. [26] introduced a theoretical learning curve for the multi-class Bayes classifier, demonstrating that data augmentation increases the sample size, which in turn reduces classification error by providing more diverse training samples. This framework highlights the importance of data augmentation for improving model performance, particularly when dealing with limited or imbalanced datasets. Chen [27] and others effectively removed noise components from pre-recorded or live recordings using a pre-trained speech enhancement model, thereby improving the model’s robustness. Salazar et al. [28] proposed the GANSO data augmentation method, combining GANs and Markov Random Fields to generate synthetic data, enhancing classifier performance in data-scarce scenarios. In 2024, Nisa [29] and colleagues proposed a noise suppression-based speech enhancement method, which effectively reduced the negative impact of degraded speech input quality on speaker recognition technology, providing a simple and effective solution for achieving efficient speaker identification in noisy environments.

In response to the limited speech and speaker information available in short speech, in 2023, Zhao Z. [30] and others proposed an improved ECAPA-TDNN model by introducing a progressive channel fusion strategy and increasing the depth and branching structure of the model, which enhanced its ability to generate deep feature representations and significantly improved performance.

Subsequently, Yao J. [31] and others built on this by introducing a parallel branch architecture and multi-head self-attention mechanism. One branch employed the multi-head self-attention mechanism to capture long-range dependencies of speech features, while the other branch utilized the SE-Res2Block (SENet, Res2Net) module to model local multi-scale features. Different feature fusion methods were adopted to effectively integrate the feature information from both branches.

In 2024, Wang et al. [32] introduced the CIFG (Convolutional Long Short-Term Memory (LSTM) with input and forget gates) module into the ECAPA-TDNN architecture to effectively model temporal relationships, thereby enhancing the model’s ability to integrate global information. Additionally, Wang applied an improved Sub-center Arcface loss function, which enhances intra-class compactness and the network’s robustness by selecting sub-centers for subclass differentiation.

Overall, with the continuous advancement of deep learning technologies, an increasing number of studies have sought to address the performance bottlenecks in speaker recognition tasks under complex noise environments and short speech conditions. These studies typically aim to improve computational efficiency while maintaining performance through structural innovations, loss function optimization, and lightweight strategies. In this paper, the proposed Chrono-ECAPA-TDNN model incorporates these strategies to specifically address the challenges of noise interference and short speech in air-ground communication scenarios.

3. Materials and Methods

3.1. Chrono-ECAPA-TDNN Network Architecture

The architecture of the Chrono-ECAPA-TDNN network is shown in Figure 1. The input data consist of the raw digital signals of air-ground communication audio, which are then processed by a feature extractor to generate Filter Bank (Fbank) features. First, the input features are processed through a 1D convolutional layer (Conv1D + ReLU + BN) to capture temporal dependency features in the speech signal. Conv1D refers to a one-dimensional convolutional operation, ReLU denotes the Rectified Linear Unit activation function and BN refers to Batch Normalization. Next, the features are passed through three layers of the Chrono Block module. The Chrono Block module consists of multiple parallel branches designed to capture both global and local features, further extracting the temporal information of the speech. The output of the three layers of the Chrono Block is then fed into a 1D convolutional layer. After multi-layer feature aggregation, the features are further extracted. Subsequently, the features are aggregated through an attentive stat pooling layer, which aggregates all temporal information and generates a fixed-length global feature vector. Finally, after being processed through a fully connected layer and Batch Normalization (FC + BN), a fixed-length speaker embedding is generated, and the inter-class distance and intra-class divergence are computed using the Sub-Center Loss function, outputting the predicted speaker class. The specific computation process is as follows:

The raw digital signals of air-ground communication audio are processed by a feature extractor to generate 80-dimensional Fbank features. The shape of the input features is

X \in R^{B \times D \times T}

, where

B

represents the batch size,

D

is the feature dimension, and

T

is the number of frames. The input features are processed through a 1D convolutional layer (Conv1D + ReLU + BN) to extract temporal dependency features, as shown in Equation (1). The output feature shape is

X_{C o n v 1} \in R^{B \times C \times T}

.

X_{Conv 1} = BN (ReLU (Conv 1 D (X))),

(1)

where

C

represents the number of convolution channels.

The features are then passed through three layers of the Chrono Block module, as shown in Equation (2). The Chrono Block module consists of multiple parallel branches designed to capture both global and local features, further extracting the temporal information of the speech. The output feature shape is

X_{c h r o n o} \in R^{B \times C \times T}

.

X_{c h r o n o}^{(l)} = Chnoro Block (X_{c o n v 1}^{(l - 1)}) for l = 1, 2, 3,

(2)

The outputs of the three layers are eventually combined into a single representation, as shown in Equation (3).

X_{c h r o n o_f i n a l} \in R^{B \times 3 C \times T},

(3)

Next, the features are passed through a 1D convolutional layer for multi-layer feature aggregation, and the output feature shape is

X_{a g g} \in R^{B \times 3 C \times T}

, as shown in Equation (4).

X_{a g g} = B N (ReLU (Conv 1 D (X_{c h r o n o_f i n a l}))),

(4)

Then, the features are passed through the Attentive Stat Pooling layer to aggregate all temporal information, generating a fixed-length global feature vector

X_{p o o l} \in R^{B \times 3 C}

, as shown in Equation (5).

X_{p o o l} = AttentiveStatPooling (X_{a g g}),

(5)

The features are passed through a fully connected layer and Batch Normalization (FC + BN), generating a fixed-length speaker embedding

X_{e m b e d} \in R^{B \times N}

, where

N

represents the output dimension, as shown in Equation (6).

X_{e m b e d} = BN (FC (X_{p o o l})),

(6)

Finally, the Sub-Center Loss function is used to compute the inter-class distance and intra-class variance, and the predicted speaker class is output, as shown in Equation (7).

l = SubCenterLoss (X_{e m b e d}, Y),

(7)

where

Y

represents the true label and

l

is the loss function.

3.2. Chrono Block

The Chrono-ECAPA-TDNN framework is an improvement upon the ECAPA-TDNN model. The key difference lies in the use of the Chrono module instead of the single-branch SE-Res2Block module in ECAPA-TDNN to capture both local and global speaker features. It consists of two parallel branches and a merging module, with one branch dedicated to capturing global features through self-attention, while the other branch focuses on extracting local features using the SE-Res2Block module.

To simultaneously capture both local features and global dependency features, the design of the Chrono Block module fully considers the core requirements of speaker representation learning. Local features, such as pitch, intonation style, and pronunciation patterns, provide individualized speech feature information, while global dependency features, such as long-range correlations in variable-length speech, capture the overall contextual relationships of the speech signal. The structure of the Chrono Block module is shown in Figure 2, consisting of two parallel branches and a feature fusion module. The parallel branches include the left branch with a Bi-LSTM module and a multi-head attention module. The Bi-LSTM module captures the global temporal dependencies of the speech signal through forward and backward LSTM layers, further enhancing the model’s ability to model long-range context. In the multi-head attention mechanism, a scaled dot-product attention calculation method commonly used in Transformers is applied, combined with a position encoding strategy, improving the model’s adaptability and robustness to variable-length speech. The right branch consists of a Squeeze-and-Excitation (SE) [5] mechanism and residual network connections, forming the SE-Res2Net module, which focuses on extracting multi-scale local features. Additionally, each layer of the Chrono Block module is equipped with LayerNorm normalization layers to stabilize the training process and enhance the model’s feature representation capability. The feature fusion module integrates the outputs of the two branches through weighted fusion, achieving a unified representation of local and global features and providing a rich information foundation for downstream speaker embedding generation.

The computation process of the Chrono Block module consists of multiple progressive steps, gradually extracting and fusing multi-scale local features and global contextual information from the speech signal. First, the Fbank features are passed through the convolutional layer to obtain a dimensionality-reduced feature representation, which is then sent to the Chrono module for further feature extraction. The Chrono module consists of two parallel branches, each capturing local and global features of the speech signal. For the left branch of the Chrono module, the input features are denoted as

h_{m - 1}

and the output features as

h_{m}^{2}

; for the right branch, the input features are

h_{n - 1}

and the output features are

h_{n}^{2}

.

The computation of the left branch is detailed as follows, as shown in Equations (8)–(10):

h_{m} = h_{m - 1} + L a y e r N o r m (h_{m - 1}),

(8)

h_{m}^{1} = h_{m} + B i L S T M (h_{m}),

(9)

h_{m}^{2} = h_{m}^{1} + M H S A (h_{m}^{1}),

(10)

where

h_{m - 1}, h_{m}, h_{m}^{1}, h_{m}^{2} \in R^{C \times T}

,

m = 1, 2, \dots, L

(

L

is maximum 3),

C

represents the convolutional channels in the Chrono module, and

T

represents the number of time frames. LayerNorm refers to layer normalization, BiLSTM refers to the bidirectional Long Short-Term Memory network, and MHSA refers to the Multi-Head Self-Attention mechanism.

The right branch focuses on local feature extraction, with input features as

h_{n - 1}

and output features as

h_{n}^{2}

. The specific computation process can be expressed as follows, as shown in Equations (11)–(13):

h_{n} = h_{n - 1} + LayerNorm (h_{n - 1}),

(11)

h_{n}^{1} = h_{n} + SERes 2 Net (h_{n}),

(12)

h_{n}^{2} = h_{n}^{1} + Conv 1 D (h_{n}^{1}),

(13)

where

h_{n - 1}, h_{n}, h_{n}^{1}, h_{n}^{2} \in R^{C \times T}

,

n = 1, 2, \dots, L

(

L

is maximum 3), SERes2Net represents the module that extracts local features through multi-scale convolution and channel attention mechanism, where Conv1D denotes the convolutional layer.

3.3. Voiceprint Embedding Extraction

To obtain more robust voiceprint embeddings, Chrono-ECAPA-TDNN aggregates the output features of all Chrono modules. Previous studies have shown that shallow features play a crucial role in capturing key information in speech and can significantly improve the representational power of voiceprint embeddings. In ECAPA-TDNN, the output features of all SE-Res2Net modules are aggregated before the Attentive Statistics Pooling (ASP) layer, which effectively enhances model performance. Similarly, in Chrono-ECAPA-TDNN, the output features of each Chrono module are aggregated before the LayerNorm layer (see Figure 2).

First, the output features of all Chrono modules are integrated through feature concatenation to form the following, as shown in Equation (14):

H^{'} = Concat (h_{m}^{2}, h_{n}^{2}),

(14)

where

H^{'} \in R^{3 C \times T}

,

C

is the convolution channels of the Chrono modules, and

T

is the number of time frames.

Then, the concatenated features are normalized to obtain the standardized aggregated feature matrix

H

, as shown in Equation (15).

H = LayerNorm (H^{'}),

(15)

The features are then passed into the Attentive Statistics Pooling layer to obtain the weight coefficients for each frame and extract global features. For the frame-level feature

H_{t}

at time frame t, the weight coefficient

α_{t}

is first computed as follows, as shown in Equations (16) and (17):

e_{t} = v^{T} f (W H_{t} + b) + k,

(16)

α_{t} = \frac{\exp (e_{t})}{\sum_{τ = 1}^{T} \exp (e_{τ})},

(17)

where

t = 1, 2, \dots, T, W \in R^{D \times D}, b \in R^{D \times 1}, D = 3 C \times T

,

k \in R

represents the learnable parameters of the ASP pooling layer, and

f (*)

represents

Tan h

Activation function.

Using the weight coefficient

α_{t}

, calculate the weighted mean vector

μ

and weighted standard deviation

σ

, as shown in Equations (18) and (19):

μ = \sum_{t = 1}^{T} α_{t} H_{t},

(18)

σ = \sqrt{\sum_{t = 1}^{T} α_{t} (H_{t} ⊙ H_{t}) - μ ⊙ μ},

(19)

where

⊙

represents the Hadamard product (element-wise multiplication).

Finally, the output of the ASP pooling layer is obtained by concatenating

μ

and

σ

, as shown in Equation (20):

Z = Concat (μ, σ),

(20)

The output is then passed through a BatchNorm layer and a Linear layer, further reducing the dimensionality to a fixed-length voiceprint embedding vector

Z \in R^{N}

of length

N

, which is used for subsequent classification tasks.

3.4. Loss Function

Sub-Center Loss [33] (shown in Figure 3) is a variant of the AAM Softmax loss function, with the primary difference being how the intra-class sample diversity and center points are handled. AAM Softmax defines a single center point for each class, causing all intra-class samples to cluster around that center, which is suitable for cases where intra-class samples are tightly distributed and inter-class differences are significant. However, when intra-class sample variance is large, a single center point is insufficient to effectively represent the class diversity. In contrast, Sub-Center Loss introduces multiple sub-centers for each class, allowing intra-class samples to cluster around different sub-centers, thereby addressing intra-class diversity with finer granularity. This approach performs better, especially on data with complex distributions. Moreover, in the high-noise environment of air-ground communication, the single center in AAM Softmax is more susceptible to outliers, whereas Sub-Center Loss, with its multiple sub-centers, effectively mitigates the impact of noise interference, resulting in a model with greater robustness.

Sub-Center Loss essentially introduces multiple sub-centers for each class to capture intra-class sample diversity with finer granularity. For the feature vector

x_{i} \in R^{N \times 1}

(where

x_{i}

is the

i

-th sample belonging to the

y_{i}

-th class and

N

is the dimension of the speaker embedding). The first step is to normalize it with respect to the sub-centers

W

using the

L_{2}

norm. By performing matrix multiplication,

W^{T} x_{i}

, the similarity score

S

for each sub-center is obtained. Then, a max-pooling operation is used to select the highest similarity sub-center score

S^{'}

for each class, determining the correlation between the feature vector and that class’s sub-center. Next, the angle

θ_{i, j}

between the feature vector and the sub-center is calculated using

\arccos (S^{'})

, and an angular margin

m

is added to

θ_{i, j}

to adjust the boundary between classes. Finally, the normalized feature vector is multiplied by a scale factor

S

to further optimize the distribution of the sub-centers. The cross-entropy loss function is then used to measure the difference between the model’s prediction and the true label, minimizing the loss to improve the classification performance of the Chrono-ECAPA-TDNN model. The specific calculation method is as follows, as shown in Equation (21):

l = - \log \frac{e^{s (\cos (θ_{i, y i} + m))}}{e^{s (\cos (θ_{i, y i} + m))} + \sum_{j = 1, j \neq y_{i}}^{N} e^{s \cos θ_{i, j}}},

(21)

where

θ_{i, j} = \arccos (\max_{k} (W_{j k}^{T} x_{i})), k \in {1, \dots, K}

.

3.5. Data Augmentation Strategies

To address the challenges posed by the high-noise environment of air-ground communication, this study incorporates three data augmentation techniques—speed perturbation, noise suppression, and spectral masking—aimed at enhancing the accuracy and robustness of speaker identification. These methods are specifically applied to the training data to improve the model’s ability to generalize across diverse and noisy conditions, while the testing data remains unaltered to evaluate the model’s performance under more realistic conditions. To further assess the model’s generalization capability across different datasets, data augmentation is also employed during the training phase for the publicly available datasets used in the evaluation process, as described later in the study. These datasets, including TIMIT, LibriSpeech, and VoxCeleb1, and CN-Celeb are subjected to the same augmentation techniques during training.

Speed augmentation involves accelerating or slowing down the speech playback speed. By adjusting the speech speed, diverse training samples are generated to help the model adapt to the speaker characteristics at different speaking speeds in real-world applications. Speed perturbation is implemented using the fmpeg toolkit in Python 3.7 to perturb the speech data’s speed, generating multiple versions of the speech signal with speed factors of 0.9, 1.0, and 1.1 to simulate variations in speech features under different speaking speeds.

Considering that real air-ground communication speech already contains significant noise, instead of using common noise addition methods, noise suppression is implemented using a U-Net model. This approach reduces background noise while appropriately attenuating the speech intensity, thereby enhancing the model’s robustness under low speech intensity conditions.

The core idea of spectral masking is to perform data augmentation in the spectral domain. Unlike traditional methods that directly manipulate the audio waveform, spectral masking processes the spectrogram of speech in the feature space. Specifically, the Fbank features extracted from the audio are first normalized and then masking operations are performed in both the time and frequency domains. By randomly masking the time and frequency of the audio features in the training set, more diverse and uncertain training samples are generated. This augmentation method effectively improves the model’s adaptability to noise and variations while reducing overfitting, providing more robust feature representations for air-ground communication speaker identification tasks.

4. Experimental Design

4.1. Speaker Identification System Construction Process

The overall framework of the self-supervised speaker identification system based on the pre-trained model is shown in Figure 4, which can be roughly divided into the following steps.

The air-ground communication speaker identification system is divided into a training phase and a testing phase, as shown in Figure 4. In the training phase, the model is pre-trained using approximately 1100 h of real annotated data to obtain a well-performing system model as the seed model. The loss function (Sub-Center Loss) is used to guide the network to update its weights, gradually reducing intra-class distances and increasing inter-class differences. The entire training process continues until the model converges, which means training stops when the loss function reaches its optimal solution.

In the testing phase, audio samples used for enrollment (denoted as L) are first input into the pre-trained neural network to extract their deep representations. For multiple audio samples from the same speaker, the extracted deep representation vectors are averaged to obtain the speaker’s central vector, known as the speaker prototype. These speaker prototypes together form the speaker database in the testing phase. Subsequently, the audio sample to be recognized (Speaker A) is also input into the neural network to extract the corresponding deep representation and map it to the category probabilities of each speaker prototype in the speaker database. Finally, the test sample is classified as the speaker prototype with the highest probability, completing the speaker identification task.

4.2. Experiment Setup

4.2.1. Dataset

The dataset used for pretraining the model is VoxCeleb2 (Vox2), which is sourced from publicly available video resources, covering various scenarios such as interviews, news, and conversations. It contains around 1100 h of audio data from more than 6000 speakers worldwide, encompassing diverse accents, genders, and background noises. The air-ground communication dataset used in the experiments (shown in Figure 5) comes from a national key research and development program. It includes control recordings from frontline airports in southwestern China and control simulation training recordings from the Civil Aviation Flight University of China. The dataset consists of 5841 short audio segments, totaling 600 min of air-ground communication records. The audio in this dataset was collected during periods of high aircraft traffic, so it contains very few non-speech segments, with most of the audio being valid speech segments. The dataset includes both Chinese and English communication. The audio content comprises air traffic control (ATC) commands, which can be categorized into three key components: call signs, action commands, and action parameters. Call signs are composed of the airline abbreviation and flight number. Action commands refer to the specific directives within ATC communications, such as ascent, descent, or holding. Action parameters offer critical supplementary details for these directives, including speed, altitude, heading, and waypoints, all of which are vital for the accurate execution of the commands. After manual labeling, there are a total of 476 speakers in this dataset, including 25 air traffic controllers (ATCOs) and 451 pilots. After audio segmentation, there are 5438 speech segments, with 2753 belonging to air traffic controllers and 2685 belonging to pilots. The training set (ATC-Communication-train) contains 4078 audio samples from 357 speakers, while the test set (ATC-Communication-test) contains 1360 samples from 119 speakers. The speakers in the training and test sets are mutually exclusive and the speech samples are distinct.

4.2.2. Hyperparameters

The experiment was conducted on a Windows operating system. The computer configuration is as follows: Intel Core i5-8400 processor, 56 GB of RAM, NVIDIA RTX 4090 24 GB graphics card, 250 GB SSD, and a 3.6 TB HDD. The Pytorch framework was used to build the neural network model. The specific model hyperparameter configuration is shown in Table 1:

The experiment compares four models, ResNet34, ECAPA-TDNN, MFA-Conformer, and Chrono-ECAPA-TDNN, to evaluate their performance in the air-ground communication speaker identification task. The experimental setup is as follows: the speech signal has a sampling rate of 16,000 Hz, a frame length of 25 ms, and a frame shift of 10 ms, with 80-dimensional Fbank features extracted. The dataset used in the experiments is stored in WAV file format with a bitrate of 128 kbps. The network channel dimension is 512 and the output representation vector is 192 dimensional. During training, data augmentation includes speed variations of 0.9×, 1.0×, and 1.1×, and the time-domain mask width range for spectral masking is set to (0, 10), while the frequency-domain mask width range is set to (0, 8). In the AAM-Softmax loss function calculation for ResNet34, ECAPA-TDNN, and MFA-Conformer, the additive angular margin (m) is set to 0.2, and the feature scale (s) is set to 30. For Chrono-ECAPA-TDNN’s Sub-center AAM-Softmax loss function, the number of sub-centers (K) is set to 3, the additive angular margin (m) is set to 0.2, and the feature scale (s) is set to 30. The optimizer used is Adam, with an initial learning rate of 0.001 and weight decay set to 1 × 10⁻⁵. To prevent overfitting, a 250-step linear warm-up strategy is applied at the beginning of training, followed by a gradual reduction in the learning rate according to a cosine function after the warm-up phase.

4.3. Evaluation

4.3.1. Identification Accuracy

Typically, speaker identification performance is evaluated using recognition accuracy (Accuracy, ACC), calculated as follows, as shown in Equation (22):

Accuracy = \frac{N_{correct}}{N_{total}},

(22)

where

N_{correct}

represents the number of samples correctly identified in the test set,

N_{total}

represents the total number of samples in the test set.

4.3.2. Equal Error Rate (EER)

The experiment uses Equal Error Rate (EER) as a performance evaluation metric. EER represents the error rate when the False Acceptance Rate (FAR) and the False Rejection Rate (FRR) are equal at a certain threshold. FAR refers to the ratio of cases where the correct speaker identity is incorrectly classified as an incorrect identity, relative to all cases predicted as negatives. FRR refers to the ratio of actual positive cases incorrectly classified as negatives, relative to all cases predicted as positives. The formulas for calculating FAR and FRR and their relationship are as follows, as shown in Equations (23) and (24):

False Acceptance Rate (FAR):

FAR = \frac{N_{fa}}{N_{fa} + N_{tn}},

(23)

False Rejection Rate (FRR):

FRR = \frac{N_{fr}}{N_{tp} + N_{fr}},

(24)

where

N_{fa}

represents the number of times a non-target speaker is incorrectly identified as the target speaker, while

N_{tn}

represents the number of times a non-target speaker is correctly identified as a non-target speaker,

N_{fr}

refers to the number of times a target speaker is incorrectly identified as a non-target speaker, and

N_{tp}

represents the number of times the target speaker is correctly identified as the target speaker.

The value when FAR and FRR are equal is the Equal Error Rate (EER). A smaller EER value indicates better speaker identification performance.

4.3.3. Real-Time Factor (RTF)

The experiment uses the real-time factor (RTF) of the voiceprint embedding extraction by the network model as the model’s latency indicator. Real-time factor (RTF) [34] an important metric to evaluate the inference speed of a speech processing system. Its calculation formula is shown in Equation (25):

R T F = \frac{T_{p r o c e s s}}{T_{a u d i o}},

(25)

where

T_{p r o c e s s}

represents the time required to process the audio and

T_{a u d i o}

represents the duration of the audio. When

R T F < 1

, the system is considered to have reached real-time performance [35].

4.3.4. Standard Deviation (STD)

The experiment uses the standard deviation (STD) [36] to measure the variability of the model’s performance across multiple experiments with different data splits. Standard deviation is an important metric to assess the consistency and stability of the model’s performance. It is calculated using the following Equation (26):

S T D = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(x_{i} - \bar{x})}^{2}},

(26)

where

x_{i}

represents the result of the

i

-th,

\bar{x}

is the mean of all experimental results, and

N

is the total number of experiments.

5. Results

5.1. Model Comparison Experiment

To verify whether the Chrono-ECAPA-TDNN (CET) model constructed in this paper outperforms three mainstream models, ResNet34 [21], ECAPA-TDNN [5], and MFA-Conformer [37], on the air-ground communication dataset, comparative experiments were conducted. The experimental results are shown in Table 2.

The Chrono-ECAPA-TDNN model outperforms the ResNet34, ECAPA-TDNN, and MFA-Conformer models in the speaker identification task. Both ResNet34 and ECAPA-TDNN have EERs above 11%, with relatively close performances, while the EER of Chrono-ECAPA-TDNN is reduced to 9.81%. Compared to ResNet34, Chrono-ECAPA-TDNN reduces the EER by 2.07% and improves the accuracy by 2.41%; compared to ECAPA-TDNN, the EER is reduced by 1.53% and the accuracy increased by 2.19%; compared to MFA-Conformer, Chrono-ECAPA-TDNN’s EER drops by 0.45%, and the accuracy improves by 1.06%. The ResNet34 model solves the gradient vanishing problem through skip connections in residual blocks, but the extracted features are relatively simple. ECAPA-TDNN captures temporal information and key features of audio through the time-delay network and SE module, enhancing feature extraction, but lacks the ability to extract global features. MFA-Conformer, without extensive pretraining on air-ground communication data, struggles to achieve high accuracy when dealing with complex noise and short utterances. In contrast, Chrono-ECAPA-TDNN further optimizes the capture of temporal features and enhances network representation, allowing it to focus on richer feature layers, thus demonstrating superior recognition performance on the complex noise-laden air-ground communication dataset.

To test the generalization ability of the proposed speaker identification model across different datasets, this study selects four open-source datasets as experimental test sets, including the TIMIT dataset, LibriSpeech, VoxCeleb1, and CN-Celeb, labeled as E1, E2, E3, and E4, respectively. By testing the Equal Error Rate (EER) of different models on these datasets, the performance of ResNet34, ECAPA-TDNN, MFA-Conformer, and Chrono-ECAPA-TDNN (CET) models is compared and analyzed in cross-dataset scenarios (as shown in Figure 6). The experimental results are summarized in Table 2, which shows the EER values of each model on datasets E1 to E4, along with the average (AVG) EER value across all datasets.

As shown in Table 2, the Chrono-ECAPA-TDNN (CET) model proposed in this paper outperforms the MFA-Conformer, ECAPA-TDNN, and ResNet34 models across all test datasets from E1 to E4. This indicates that the CET model has strong generalization ability and adaptability to complex speech environments in cross-dataset scenarios.

5.2. Random Cross-Validation Experiment

To comprehensively evaluate the performance of the model and ensure its stability and robustness under different data partitions, this study designs a cross-validation experiment as shown in Table 3. In the experiment, the air-ground communication dataset is randomly divided into 9 groups in a 3:1 ratio (including the initial dataset, making a total of 10 groups) and undergoes multiple rounds of cross-validation to obtain more comprehensive and reliable evaluation results.

In each round of the experiment, the training and testing data are non-overlapping to ensure the independence and reliability of the evaluation. Performance is assessed using metrics such as Equal Error Rate (EER) and accuracy, with results recorded accordingly. To ensure the comprehensiveness of the evaluation, the mean and standard deviation of each round of experiments are calculated, enabling effective assessment of the model’s consistency across different data partitions. In typical machine learning model performance evaluations, it is considered that the model performs stably and robustly when the standard deviation is less than 5%, indicating strong generalization ability and minimal performance fluctuation. This cross-validation design allows for a thorough and in-depth evaluation of the model’s stability and robustness across different data partitions.

Based on the experimental results, the model’s performance was evaluated across multiple experiments, with the Equal Error Rate (EER) and accuracy being the primary metrics. The mean EER across all experiments, including the CET data, is 88.23%, with a standard deviation of 1.31%, indicating relatively stable performance across different data splits. The mean accuracy is 9.80%, with a standard deviation of 0.11%, further highlighting the model’s consistent performance.

5.3. Ablation Study

To validate the effectiveness of key components in the Chrono-ECAPA-TDNN network model, an ablation study was conducted. The study focused on the improvements made to the parallel branch architecture, including the Bi-LSTM module, multi-head attention module, and the Sub-center AAM-Softmax loss function calculation module. Each of these modifications contributed to performance improvements in the air-ground communication speaker identification system. The experimental results are shown in Table 4.

The first row in Table 4 refers to the training strategy using the SE-Res2Block (right branch) module with a single branch, excluding the parallel branch architecture (baseline model). The second row refers to the model with the multi-head attention module removed, while retaining the Bi-LSTM module and the Sub-Center Loss function. The third row refers to the model with the Bi-LSTM module removed, while keeping the multi-head attention module and Sub-Center Loss function. The fourth row refers to the model where the Sub-Center Loss function is replaced by the AAM Softmax function, while keeping both the multi-head attention module and the Bi-LSTM module. The fifth row represents the evaluation results of the proposed CET model without data augmentation, and the sixth row shows the evaluation results of the CET model with data augmentation.

The ablation study results in Table 4 show that the Chrono-ECAPA-TDNN model significantly improved performance after incorporating the Bi-LSTM module, multi-head attention mechanism in the parallel branch architecture, and Sub-Center Loss function. Compared to the baseline model, the EER decreased from 14.86% to 9.81%, and the accuracy increased from 82.08% to 88.62%. The EER dropped by 5.05%, while accuracy increased by 6.54%. This indicates that the improvements in the Chrono-ECAPA-TDNN model effectively enhance its ability to represent audio features.

The introduction of the Bi-LSTM module improved accuracy by 2.56%, with a slight decrease in EER by 0.53%, effectively capturing both global and local speaker features. The multi-head attention mechanism further increased accuracy by 4.4% and reduced EER by 3.74%. The use of the Sub-Center Loss function further boosted accuracy by 3.52%, with a slight decrease in EER by 0.31%. The ablation study results in Table 4 also confirm the significant enhancement in the performance of the Chrono-ECAPA-TDNN model through data augmentation. When data augmentation was not applied, the model’s accuracy was 87.37%, with an EER of 9.97%. After incorporating data augmentation techniques, accuracy increased to 88.62% and EER further decreased to 9.81%. Compared to the model without data augmentation, accuracy improved by 1.25% and EER decreased by 0.16%.

In summary, the ablation experiment results indicate that optimizing the ECAPA-TDNN model with the Bi-LSTM module, multi-head attention mechanism parallel branch architecture, Sub-Center Loss function, and data augmentation techniques can effectively improve the model’s classification performance. These optimization strategies complement each other, enhancing the model’s classification capabilities and further optimizing its application effectiveness.

6. Discussion

The proposed Chrono-ECAPA-TDNN (CET) model has demonstrated significant improvements in speaker identification performance, particularly in the challenging environment of air-ground communication. The experimental results show that CET effectively addresses the issues of complex noise and short speech, two common challenges in this domain. The incorporation of Bi-LSTM modules and multi-head attention mechanisms in the Chrono Block module has proven to be highly beneficial in capturing both global and local speaker features, as evidenced by the model’s reduced EER and increased accuracy in comparison to baseline models such as ResNet34, ECAPA-TDNN, and MFA-Conformer.

One of the most notable contributions of this work is the Sub-center AAM-Softmax loss function, which significantly enhanced the model’s classification ability by improving intra-class feature compactness and inter-class separation. This novel loss function, when combined with the data augmentation strategies, helped the model perform well even in noisy environments with short speech segments. In particular, the data augmentation techniques—such as speed perturbation, spectral masking, and noise suppression—played a crucial role in boosting the model’s robustness, making it more adaptable to real-world scenarios where speech quality can vary greatly.

The results from the cross-domain testing further validate the generalization capability of CET. By achieving superior performance across multiple datasets, including VoxCeleb2, TIMIT, LibriSpeech, VoxCeleb1, and CN-Celeb, CET has shown that it is not only effective for air-ground communication but can also be applied to a broad range of speaker identification tasks in various domains.

7. Conclusions

In this paper, a novel and efficient robust speaker identification network model for air-ground communication scenarios, called Chrono-ECAPA-TDNN (CET), is proposed. This model integrates parallel branch architecture, Bi-LSTM module, multi-head attention mechanism, and Sub-center AAM-Softmax loss function, showing excellent performance in handling complex noise and short utterance issues. The CET model achieves a real-time factor (RTF) of 0.0202 for speaker embedding extraction, meeting real-time recognition standards (RTF < 1). In the model performance comparison experiments, the CET model achieves an EER of 9.81% and an accuracy of 88.62% on the air-ground communication dataset, significantly outperforming ResNet34, ECAPA-TDNN, and MFA-Conformer. The stability of the model was evaluated through random cross-validation experiments. The mean EER across all experiments, including CET data, is 88.23%, with a standard deviation of 1.31%, and the mean accuracy is 9.80%, with a standard deviation of 0.11%. Both standard deviations are below 5%, indicating that the model exhibits relatively stable performance across different data splits. The ablation experiment results show that the Bi-LSTM module, multi-head attention mechanism, and Sub-Center Loss function play key roles in improving the model’s performance, while data augmentation significantly enhances the model’s generalization ability in complex scenarios. The improvements made to the CET model, particularly the effective fusion of local and global features, offer a new approach for speaker identification tasks in short utterances and high-noise environments in air-ground communication. In future work, we plan to apply the Chrono-ECAPA-TDNN model to real-time air-ground communication speaker identification systems and further optimize the model’s deployment on edge computing devices, conducting more speaker identification research in practical application scenarios.

Author Contributions

Conceptualization, W.P. and S.C. (Shenhao Chen); methodology, S.C. (Shenhao Chen); software, S.C. (Shenhao Chen); validation, S.C. (Shenhao Chen) and W.P.; formal analysis, S.C. (Shenhao Chen) and Y.W.; investigation, S.C. (Shenhao Chen); resources, W.P.; data curation, S.C. (Sheng Chen); writing—original draft preparation, S.C. (Shenhao Chen); writing—review and editing, W.P., S.C. (Shenhao Chen) and X.W.; visualization, S.C. (Shenhao Chen); supervision, S.C. (Shenhao Chen) and Y.W.; project administration, S.C. (Shenhao Chen); funding acquisition, W.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (U2333209), National Key R&D Program of China (No. 2021YFF0603904), National Natural Science Foundation of China (U2333207), Sichuan Provincial Civil Aviation Flight Technology and Flight Safety Engineering Technology Research Center (GY2024-45E), and the Fundamental Research Funds for the Central Universities Grant Number (24CAFUC03046).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Togneri, R.; Pullella, D.J. An overview of speaker identification: Accuracy and robustness issues. IEEE Circuits Syst. Mag. 2011, 11, 23–61. [Google Scholar] [CrossRef]
Zhang, R.; Yan, Z.; Wang, X.; Deng, R.H. Livoauth: Liveness detection in voiceprint authentication with random challenges and detection modes. IEEE Trans. Ind. Inform. 2022, 19, 7676–7688. [Google Scholar] [CrossRef]
Hanifa, R.M.; Isa, K.; Mohamad, S.J.C.; Engineering, E. A review on speaker recognition: Technology and challenges. Comput. Electr. Eng. 2021, 90, 107005. [Google Scholar] [CrossRef]
Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-vectors: Robust dnn embeddings for speaker recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar]
Desplanques, B.; Thienpondt, J.; Demuynck, K. Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. In Proceedings of the NTERSPEECH 2020, Shanghai, China, 25–29 October 2020. [Google Scholar]
Pan, W.; Wang, Y.; Zhang, Y.; Han, B. ATC-SD Net: Radiotelephone Communications Speaker Diarization Network. Aerospace 2024, 11, 599. [Google Scholar] [CrossRef]
Chung, J.S.; Nagrani, A.; Zisserman, A. Voxceleb2: Deep speaker recognition. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018. [Google Scholar]
Lopes, C.; Perdigao, F. Phone recognition on the TIMIT database. In Speech Technologies; IntechOpen: London, UK, 2011; Volume 1, pp. 285–302. [Google Scholar]
Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
Nagrani, A.; Chung, J.S.; Xie, W.; Zisserman, A. Voxceleb: Large-scale speaker verification in the wild. Comput. Speech Lang. 2020, 60, 101027. [Google Scholar] [CrossRef]
Fan, Y.; Kang, J.; Li, L.; Li, K.; Chen, H.; Cheng, S.; Zhang, P.; Zhou, Z.; Cai, Y.; Wang, D. Cn-celeb: A challenging chinese speaker recognition dataset. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7604–7608. [Google Scholar]
Chen, J.-Y.; Jeng, J.-T. Text-Independent Speaker Verification Using Lightweight 3D Convolutional Neural Networks. In Proceedings of the 2024 International Conference on System Science and Engineering (ICSSE), Hsinchu, Taiwan, 26–28 June 2024; pp. 1–5. [Google Scholar]
Abbood, Z.A.; Yasen, B.T.; Ahmed, M.R.; Duru, A.D. Speaker identification model based on deep neural networks. Iraqi J. Comput. Sci. Math. 2022, 3, 108–114. [Google Scholar]
Zhu, Y.; Mak, B. Bayesian self-attentive speaker embeddings for text-independent speaker verification. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 1000–1012. [Google Scholar] [CrossRef]
Snyder, D.; Garcia-Romero, D.; Povey, D.; Khudanpur, S. Deep neural network embeddings for text-independent speaker verification. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 999–1003. [Google Scholar]
Wan, L.; Wang, Q.; Papir, A.; Moreno, I.L. Generalized end-to-end loss for speaker verification. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4879–4883. [Google Scholar]
Zhang, C.; Koishida, K. End-to-end text-independent speaker verification with triplet loss on short utterances. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 1487–1491. [Google Scholar]
Variani, E.; Lei, X.; McDermott, E.; Moreno, I.L.; Gonzalez-Dominguez, J. Deep neural networks for small footprint text-dependent speaker verification. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 4052–4056. [Google Scholar]
Zhu, Y.; Ko, T.; Snyder, D.; Mak, B.; Povey, D. Self-attentive speaker embeddings for text-independent speaker verification. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 2–6. [Google Scholar]
Vydana, H.K.; Vuppala, A.K. Residual neural networks for speech recognition. In Proceedings of the 2017 25th European Signal Processing Conference (Eusipco), Kos Island, Greece, 28 August–2 September 2017; pp. 543–547. [Google Scholar]
Chung, J.S.; Huh, J.; Mun, S.; Lee, M.; Heo, H.S.; Choe, S.; Ham, C.; Jung, S.; Lee, B.-J.; Han, I. In defence of metric learning for speaker recognition. In Proceedings of the Interspeech 2020, Shanghai, China, 25–28 October 2020. [Google Scholar]
Liu, B.; Qian, Y. ECAPA++: Fine-grained deep embedding learning for TDNN based speaker verification. In Proceedings of the International Speech Communication Association (INTERSPEECH 2023), Dublin, Ireland, 20–24 August 2023; pp. 3132–3136. [Google Scholar]
Zhang, E.; Wu, Y.; Tang, Z. SC-EcapaTdnn: ECAPA-TDNN with Separable Convolutional for Speaker Recognition. In Proceedings of the International Conference on Intelligence Science, Nanjing, China, 25–28 October 2024; pp. 286–297. [Google Scholar]
Zouhir, Y.; Zarka, M.; Ouni, K. Bionic Cepstral coefficients (BCC): A new auditory feature extraction to noise-robust speaker identification. Appl. Acoust. 2024, 221, 110026. [Google Scholar] [CrossRef]
Chauhan, N.; Isshiki, T.; Li, D. Enhancing Speaker Recognition Models with Noise-Resilient Feature Optimization Strategies. In Proceedings of the Acoustics, Broadbeach, Australia, 6–8 November 2024; pp. 439–469. [Google Scholar]
Salazar, A.; Vergara, L.; Vidal, E. A proxy learning curve for the Bayes classifier. Pattern Recognit. 2023, 136, 109240. [Google Scholar] [CrossRef]
Chen, Y.-W.; Hung, K.-H.; Li, Y.-J.; Kang, A.C.-F.; Lai, Y.-H.; Liu, K.-C.; Fu, S.-W.; Wang, S.-S.; Tsao, Y. CITISEN: A deep learning-based speech signal-processing mobile application. IEEE Access 2022, 10, 46082–46099. [Google Scholar] [CrossRef]
Salazar, A.; Vergara, L.; Safont, G. Generative Adversarial Networks and Markov Random Fields for oversampling very small training sets. Expert Syst. Appl. 2021, 163, 113819. [Google Scholar] [CrossRef]
Nisa, R.; Baba, A.M. A speaker identification-verification approach for noise-corrupted and improved speech using fusion features and a convolutional neural network. Int. J. Inf. Tecnol. 2024, 16, 3493–3501. [Google Scholar] [CrossRef]
Zhao, Z.; Li, Z.; Wang, W.; Zhang, P. Pcf: Ecapa-tdnn with progressive channel fusion for speaker verification. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Yao, J.; Liang, C.; Peng, Z.; Zhang, B.; Zhang, X.-L. Branch-ECAPA-TDNN: A parallel branch architecture to capture local and global features for speaker verification. In Proceedings of the International Speech Communication Association (INTERSPEECH 2023), Dublin, Ireland, 20–24 August 2023; pp. 1943–1947. [Google Scholar]
Wang, C.; Xu, L.; Zhu, H.; Cheng, X. Robustness study of speaker recognition based on ECAPA-TDNN-CIFG. J. Comput. Methods Sci. Eng. 2024, 24, 3287–3296. [Google Scholar] [CrossRef]
Deng, J.; Guo, J.; Liu, T.; Gong, M.; Zafeiriou, S. Sub-center arcface: Boosting face recognition by large-scale noisy web faces. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XI 16. pp. 741–757. [Google Scholar]
Higuchi, Y.; Inaguma, H.; Watanabe, S.; Ogawa, T.; Kobayashi, T. Improved Mask-CTC for non-autoregressive end-to-end ASR. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 8363–8367. [Google Scholar]
Sato, H.; Moriya, T.; Mimura, M.; Horiguchi, S.; Ochiai, T.; Ashihara, T.; Ando, A.; Shinayama, K.; Delcroix, M. Speakerbeam-ss: Real-time target speaker extraction with lightweight conv-tasnet and state space modeling. In Proceedings of the Interspeech 2024, Kos, Greece, 1–5 September 2024. [Google Scholar]
Pacheco, A.G.; Krohling, R.A. Ranking of classification algorithms in terms of mean–standard deviation using A-TOPSIS. Ann. Data Sci. 2018, 5, 93–110. [Google Scholar] [CrossRef]
Zhang, Y.; Lv, Z.; Wu, H.; Zhang, S.; Hu, P.; Wu, Z.; Lee, H.-y.; Meng, H. Mfa-conformer: Multi-scale feature aggregation conformer for automatic speaker verification. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022. [Google Scholar]

Figure 1. Chrono-ECAPA-TDNN network architecture.

Figure 2. Architecture of the Chrono module.

Figure 3. Diagram of the sub-center AAM-Softmax loss function calculation.

Figure 4. Flowchart of air-ground communication speaker identification.

Figure 5. Composition of the air-ground communication dataset.

Figure 6. EER test results for models on different datasets.

Table 1. Hyperparameter settings for Chrono Block model training.

Name	Settings
Input features	Fbank = 80 dimensions
Number of attention heads	8
Chrono Block	3
Chrono Block channel dimension	512
Chrono Block output dimension	192
Training epochs	80
Batch size	128
Optimizer	Adam
Scheduler	WarmupLR
Additive angular margin m	0.2
Scaling factor s	30
Number of sub-centers K	3

Table 2. Model comparison experimental results.

Model	Parameters/10⁶	Air-Ground Communication Dataset			E1	E2	E3	E4	AVG
Model	Parameters/10⁶	ACC (%)	EER (%)	RTF	EER (%)
ResNet34 [21]	5.36	86.21	11.88	0.0153	6.81	7.03	7.48	8.25	7.39
ECAPA-TDNN [5]	6.19	86.43	11.34	0.0172	6.02	6.24	6.66	7.24	6.54
MFA-Conformer [37]	8.92	87.56	10.26	0.0183	5.82	6.04	6.38	7.07	6.33
CET	10.24	88.62	9.81	0.0202	5.37	5.61	5.94	6.33	5.81

Table 3. Random cross-validation experiment results.

Experiment Number	EER%	ACC%
0	87.25	9.75
1	89.15	9.65
2	88.03	9.89
3	90.12	9.72
4	86.91	9.80
5	88.50	9.60
6	89.85	9.95
7	87.75	9.85
8	90.20	9.92
CET	88.62	9.81
AVG	88.23	9.80
STD	1.31	0.11

Table 4. Ablation experiment results.

Experiment Number	Bi-LSTM	Multi-Headed Self-Attention	Sub-Center Loss	AAM-Softmax Loss	Data Enhancement	ACC (%)	EER (%)
0	×	×	√	×	×	82.08	14.86
1	√	×	√	×	×	84.22	13.55
2	×	√	√	×	×	86.06	10.34
3	√	√	×	√	×	85.1	10.12
4	√	√	√	×	×	87.37	9.97
5	√	√	√	×	√	88.62	9.81

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pan, W.; Chen, S.; Wang, Y.; Chen, S.; Wang, X. The Speaker Identification Model for Air-Ground Communication Based on a Parallel Branch Architecture. Appl. Sci. 2025, 15, 2994. https://doi.org/10.3390/app15062994

AMA Style

Pan W, Chen S, Wang Y, Chen S, Wang X. The Speaker Identification Model for Air-Ground Communication Based on a Parallel Branch Architecture. Applied Sciences. 2025; 15(6):2994. https://doi.org/10.3390/app15062994

Chicago/Turabian Style

Pan, Weijun, Shenhao Chen, Yidi Wang, Sheng Chen, and Xuan Wang. 2025. "The Speaker Identification Model for Air-Ground Communication Based on a Parallel Branch Architecture" Applied Sciences 15, no. 6: 2994. https://doi.org/10.3390/app15062994

APA Style

Pan, W., Chen, S., Wang, Y., Chen, S., & Wang, X. (2025). The Speaker Identification Model for Air-Ground Communication Based on a Parallel Branch Architecture. Applied Sciences, 15(6), 2994. https://doi.org/10.3390/app15062994

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Speaker Identification Model for Air-Ground Communication Based on a Parallel Branch Architecture

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Chrono-ECAPA-TDNN Network Architecture

3.2. Chrono Block

3.3. Voiceprint Embedding Extraction

3.4. Loss Function

3.5. Data Augmentation Strategies

4. Experimental Design

4.1. Speaker Identification System Construction Process

4.2. Experiment Setup

4.2.1. Dataset

4.2.2. Hyperparameters

4.3. Evaluation

4.3.1. Identification Accuracy

4.3.2. Equal Error Rate (EER)

4.3.3. Real-Time Factor (RTF)

4.3.4. Standard Deviation (STD)

5. Results

5.1. Model Comparison Experiment

5.2. Random Cross-Validation Experiment

5.3. Ablation Study

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI