Global–Local Self-Attention Based Transformer for Speaker Verification

Xie, Fei; Zhang, Dalong; Liu, Chengming

doi:10.3390/app121910154

Open AccessCommunication

Global–Local Self-Attention Based Transformer for Speaker Verification

by

Fei Xie

^1,2,3,

Dalong Zhang

^2,3 and

Chengming Liu

^1,3,*

¹

School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou 450002, China

²

Zhongyuan Network Security Research Institute, Zhengzhou University, Zhengzhou 450002, China

³

Zhongyuan Cyber Security Research Institute Research Base of Songshan Laboratory, Zhengzhou 450002, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(19), 10154; https://doi.org/10.3390/app121910154

Submission received: 26 August 2022 / Revised: 25 September 2022 / Accepted: 4 October 2022 / Published: 10 October 2022

(This article belongs to the Special Issue Deep Convolutional Neural Networks)

Download

Browse Figures

Versions Notes

Abstract

:

Transformer models are now widely used for speech processing tasks due to their powerful sequence modeling capabilities. Previous work determined an efficient way to model speaker embeddings using the Transformer model by combining transformers with convolutional networks. However, traditional global self-attention mechanisms lack the ability to capture local information. To alleviate these problems, we proposed a novel global–local self-attention mechanism. Instead of using local or global multi-head attention alone, this method performs local and global attention in parallel in two parallel groups to enhance local modeling and reduce computational cost. To better handle local location information, we introduced locally enhanced location encoding in the speaker verification task. The experimental results of the VoxCeleb1 test set and the VoxCeleb2 dev set demonstrated the improved effect of our proposed global–local self-attention mechanism. Compared with the Transformer-based Robust Embedding Extractor Baseline System, the proposed speaker Transformer network exhibited better performance in the speaker verification task.

Keywords:

speaker recognition; transformer; self-attention mechanism; speaker verification

1. Introduction

Speaker verification determines whether the identity of the speaker of a test utterance is the same as that of the speaker of a reference speech on the basis of the enrollment utterances of the speaker. Research in this area has mainly focused on obtaining a fixed-dimensional vector representing an utterance, known as speaker embedding. These speaker embeddings are then scored to verify the speaker’s identity. Research on speaker embedding extractors aims to enhance inter-speaker variability and suppress intra-speaker variability. In general, extracting speaker embeddings is a crucial factor that largely determines the performance of speaker verification systems.

In recent years, the Transformer model [1] has demonstrated excellent performance in natural language processing (NLP). Interest in the Transformer model for speech processing has exploded among researchers. Inspired by the successful application of the Transformer model in NLP, some studies [2,3,4], have tried to replace segment-level pooling layers or frame-level convolutional layers to apply the Transformer model to speaker recognition. However, these works on self-attention are still dominated by previous methods, such as Residual Network (ResNets) [5,6,7], Time-Delay Neural Network (TDNNs) [8,9], and Long Short-Term Memory (LSTM) networks [10]. Other works have used transformer as the backbone but still did not alleviate the limitation of Transformer in speaker verification, such as s-vectors [11].

Since the input is a speech signal, the speaker verification task is different from NLP. There are two challenges to applying Transformer to speaker verification: (1) Transformer are difficult to scale efficiently since acoustic features are much longer than text sentences [12]; (2) compared with CNN, Transformer is insufficient in capturing local information. Due to the success of networks such as TDNN that focus on local features, it is generally believed that local features will improve network performance, so we believe that enhancing the Transformer model’s ability to capture local information will improve its performance in speaker recognition.

In this work, we propose global–local self-attention to enable the Transformer model to model local features while maintaining the ability to model long-distance dependencies. We divided attention heads into different groups and performed local or global self-attention operations for different groups. This strategy of splitting attention heads does not introduce additional computational costs, and it enhances the ability to capture global dependencies and model local information. In the encoder and decoder, we used additional skip connections to aggregate features at different levels. Furthermore, we introduced locally enhanced positional encoding to further enhance the locality of the model. Without adding extra computation, we improved the performance of Transformer in speaker confirmation tasks by combining multi-level features and enhancing local information design.

The paper is organized as follows: Section 2 reviews the previous work related to the self-attention mechanism in speaker recognition. Section 3 presents and explains our proposed model. In Section 4, we discuss the experimental details and analyze the results. Section 5 concludes this paper.

2. Related Work

Many convolutional neural networks dominate the field of speaker recognition and have great success. Recently, due to the excellent performance of Transformer in NLP and speech recognition, some works have studied the problem of applying Transformer to speaker recognition.

The attention mechanism is at the heart of Transformer’s excellent performance. A line of work applies attention mechanisms to pooling mechanisms for speaker recognition as an alternative to aggregating temporal information. Okabe et al. [9] proposed an attentive statistics pooling method that provided the importance of the frame. The attention mechanism was combined with a TDNN-based embedding extractor to assign different weights to different frames and generate weighted means and standard deviation. Cai et al. [5] and Zhu et al. [13] proposed a pooling layer incorporating a self-attention mechanism to obtain utterance-level representations. Wu et al. [14] improved it by adopting vectorial attention instead of scalar attention. India et al. [2] presented double multi-head attention pooling, which extended the previously proposed self-multi-head attention-based method. An additional self-attention layer, which enhanced the pooling mechanism by assigning weights to the information captured by each head, was added to the pooling layer. Wang et al. [15] proposed multi-resolution multi-head attention pooling, which fused the attention weights of different resolutions to improve the diversity of attention heads. Instead of utilizing multi-head attention in parallel, Zhu et al. [3] proposed serialized multi-layer multi-head attention, which aimed to aggregate and propagate attention statistics from one layer to the next in a serialized manner.

Different from the above studies, some studies have focused on channel-wise attention. Yu et al. [16] proposed a dynamic channel-wise selection mechanism based on softmax attention, integrating information from multiple network branches with a channel-wise selection mechanism. Jiang et al. [17] introduced a gating mechanism to provide channel-wise attention by exploiting inter-dependencies across channels. These works extended the attention mechanism to the channel dimension to select more important channel information. This led to limited improvement in speaker recognition system performance.

In recent years, some works have directly stacked attention layers as a part of layers or the whole embedding extractor. Shi et al. [4,18] applied attention layers and stacked Transformer encoders on frame-level encoders and segment-level encoders, respectively, to capture speaker information locally and globally. A study by Shi et al. [4] was an improvement on Shi et al. [18] that used Transformer encoders with memory to replace the attention layer, and it proposed the idea of using Transformer blocks to process acoustic features segmented into segments. However, it did not integrate the operation of the split window into the Transformer module. Desplanques et al. [19] further incorporated channel attention with a global context into the frame-level layers and statistics pooling layer for better performance. These works, such as [8], are still dominated by sophisticated convolutional networks, Conversely, Safari et al. [20] proposed a serialized multi-layer multi-head attention. This work consisted of three main stages, namely a frame-level feature processor, a serialized attention mechanism, and a speaker classifier. The frame-level feature processor used TDNN to extract high-level representations of the input sound features. The serialized attention mechanism was included in a concatenated self-attention encoding structure that stacked Transformer encoder blocks followed by an additional attention pooling. This structure was used to aggregate variable-length feature sequences into a fixed-dimensional representation to create discriminative speaker embeddings. Metilda et al. [11] proposed s-vectors, which replaced the TDNN of [8] with Transformer encoder modules, which were stacked, followed by a statistical pooling layer and two linear layers. In order to capture the speaker characteristics better, this work used self-attention as the backbone of our architecture. Its advantage is that it was not limited to a limited context and focused on all frames at each time step. These works show that Transformer models have the potential to be applied to speaker verification. However, Transformer-based embedding extractors suffer from inferior performance in speaker recognition due to the lack of capacity to capture the local feature. Wang et al. [12] proposed a multi-view attention mechanism that captured long-distance dependencies and modeled the locality by controlling the self-attention receptive field for each head by a head-wise masking matrix. This work made some progress on this problem. It used a mask to realize the calculation of local self-attention, but masking the calculated results wastes computing resources.

3. Proposed Architecture

Using the original self-attention alone may not be sufficient to capture local contextual features of utterances. To better capture speaker features, we proposed global–local self-attention in our architecture. In this section, we introduce the structure of each module and explain how these designs were incorporated into our proposed model. The following subsections focus on the different submodules. Figure 1 presents the complete architecture of the model. BN stands for Batch Normalization [21].

3.1. Overall Architecture

The overall architecture of our proposed method is shown in Figure 1. The input was 80-dimension mel-filter banks and leveraged a one-dimensional convolutional layer (convolutional layer with kernel size 3 and stride 1) to obtain

C \times T

outputs, and

C

and

T

refer to the number of channel dimensions and time dimensions, respectively. Convolution utilizes overlapping windows to form coarse features, which lay the foundation for extracting speaker-discriminative embeddings. We used the architecture with encoders and decoders as the embedding extractor. After the decoder, we employed an x-vector-like architecture consisting of attention pooling [9] followed by a fully connected layer to generate the final speaker-characterizing embedding. The whole system contains up to 25.2 million parameters.

3.2. Transformer Block

The overall topology of the Transformer block is illustrated in Figure 2a, with two differences from the original Transformer module [1]; namely, we replaced the multi-head self-attention mechanism with our proposed global–local self-attention mechanism, and to introduce the local inductive bias, locally enhanced positional encoding [22] was added as a parallel module to our proposed self-attention mechanism. The Transformer block maintained the size of the feature maps and was set with a 3.2 MLP ratio and 8 attention heads. Transformer block is formally defined as:

{\hat{X}}^{l} = G L - A t t e n t i o n (L a y e r N o r m (X^{l - 1})) + X^{l - 1}

(1)

X^{l} = M L P (L a y e r N o r m ({\hat{X}}^{l})) + {\hat{X}}^{l}

(2)

where

X^{l}

denotes the output of the lth transformer block of the encoder or decoder, and if it exists at the beginning of each module, it represents the output of the previous module.

3.2.1. Global–Local Self-Attention

Despite its strong ability to model global dependencies, the original full self-attention mechanism struggles to capture the local information for utterances longer than text. In a recent study [12], local self-attention with a sliding window was applied to speaker recognition and achieved competitive performance. Inspired by [23], we proposed a novel global–local self-attention mechanism that improves the capability to capture local features while retaining the capability to model long-distance dependencies. As shown in Figure 2b, for half of the channels of the feature maps, the self-attention mechanism is implemented as local self-attention, and each attention head has a sliding window of the same size, while for the other half of the channel, it is implemented as global self-attention without a sliding window. Similar to the original full self-attention mechanism, the input features

X \in R^{C \times T}

are linearly transformed to K attention heads, and then each attention head will perform local or global self-attention. For global–local self-attention, we a used a non-overlapping sliding window to partition

X

into

[X_{1}, \dots, X_{N}]

of an equal window size

w

.

w

is the size of the window, which gives the model better learning ability. Assuming that the matrix queries, keys, and values of the kth attention head all have the dimension

d_{k}

, then the proposed local self-attention mechanism for the kth head is defined as:

X = [X_{1}, X_{2}, \dots, X_{N}], N = \frac{T}{w},

(3)

Y_{i}^{k} = A t t e n t i o n (X_{i} W_{k}^{Q}, X_{i} W_{k}^{K}, X_{i} W_{k}^{V}),

(4)

L o c a l - A t t e n t i o n_{k} (X) = [Y_{1}^{k}, Y_{2}^{k}, \dots, Y_{i}^{k}]

(5)

where

W_{k}^{Q}, W_{k}^{K}, W_{k}^{V} \in R^{C \times d_{k}},

are the linear projection parameter matrices of the queries, keys, and values for the kth attention head, respectively, and

d_{k}

is set as

C / K

. We divided the K attention heads equally into two distinct groups. K is usually an even number so that the attention heads are evenly divided into two groups. The first group of attention heads performs local self-attention, while the second group of attention heads performs global self-attention. The difference between the calculation method of global self-attention and multi-head self-attention is that the former needs to add locally enhanced positional encoding, and the output of the

k^{t h}

head is recorded as

G l o b a l - A t t e n t i o n_{k} (X)

.

h e a d_{k} = \{\begin{array}{l} L o c a l - A t t e n t i o n_{k} (X), k = 1, \dots K / 2 \\ G l o b a l - A t t e n t i o n_{k} (X), k = K / 2 + 1, \dots, K \end{array}

(6)

Finally, the results of these two kinds of attention are concatenated together as the input of MLP and denoted as

G L - A t t e n t i o n (X)

:

G L - A t t e n t i o n (X) = c a t (h e a d_{1}, \dots, h e a d_{k}) W

(7)

where

W \in R^{C \times C}

is the projection matrix that projects the self-attention results into the target output dimension. The key design is to split the attention heads into two different groups and perform local and global self-attention operations in parallel. This enables local attention to function under the guidance of global attention so that global information can better interact with local information. We adopted an optimal window size that outperformed other window sizes to achieve the best performance for our proposed method.

3.2.2. Locally Enhanced Positional Encoding

The positional encoding mechanism plays a pivotal role in the Transformer model. Since the self-attention operation is permutation-invariant, it ignores location information within the input features. To add this information, we considered a straightforward way to add position information to the linear projection values. In addition, we wanted the input element to pay more attention to the location information of its local neighborhood. Therefore, we adopted the locally enhanced positional encoding (LePE) method. LePE is generated by applying a depth-wise convolutional layer [24] on the value V. Given the matrices Q, K, and V in the Transformer model, after adding LePE, the proposed self-attention mechanism can be formulated as:

A t t e n t i o n (Q, K, V) = S o f t M a x (\frac{Q K^{T}}{\sqrt{d}}) V + D W C o n v (V)

(8)

In this way, LePE conveniently adds local contextual location information to input elements.

3.3. Encoder

In our work, the encoder layer consisted of

N_{i}

sequential Transformer encoder blocks. These deeper features w generally considered to be more complex features and can effectively represent the speaker’s identity. However, evidence in [19] suggested that shallow feature maps in hierarchical networks also contribute to more robust speaker embeddings. We argue that this view is also present in our proposed Transformer model. After a sequential of Transformer blocks, we concatenated the outputs of each Transformer block by skip connections to generate new feature maps. Then, a fully connected layer (called a sub-block aggregation net) processed the aggregated features to output the features of the decoder layer. After concatenating the outputs of the Transformer blocks of different layers, the sub-block aggregation net processed the concatenated information and adjusted the dimension of the features to be the same as the input of the next module. The decoder layer has the same architecture as the encoder layer, including several sequential Transformer blocks and a sub-block aggregation net. The difference between an encoder and a decoder is that the former is generally deeper than the latter. For both the encoder layer and the decoder layer, we used layer normalization [25] for the aggregated information before the sub-block aggregation net. In this work, we used a 4-layer encoder and a 3-layer decoder.

In our proposed architecture, we used the sub-block aggregation net in the encoder and decoder to aggregate features at different levels, which can prevent the model from consuming too much memory. The encoder extracted the fixed-length representation from the coarse speech features output by the convolutional layer as intermediate features passed to the decoder to obtain utterance-level speaker embeddings. Finally, the output of the decoder was used by the pooling layer to generate the final speaker embedding.

4. Experiments

4.1. Data and Futures

For this work, we used the VoxCeleb dataset as our training and test set. The dataset consisted of two versions, VoxCeleb1 [26] and VoxCeleb2 [27]. The VoxCeleb1 contains over 100,000 utterances from 1251 celebrities, while the VoxCeleb2 contains over 1 million utterances from 6112 identities. There is no overlap between the two versions. All data preparation steps were performed using the SpeechBrain VoxCeleb recipe [28]. All our systems were trained in SpeechBrain and evaluated on the VoxCeleb1 test sets.

The input features were 80-dimensional mel-filter banks with a 25 ms window and a 10 ms shift to represent the speech signal, while the input features were 512 channels. To make the length of the input divisible by the window size, we dropped the last frame. Data augmentation used the SpeechBrain VoxCeleb recipe during the training process in combination with the publicly available RIR dataset provided in [29]. Finally, we applied SpecAugment [30], which randomly masked 0 to 5 frames in the time domain and 0 to 10 channels in the frequency domain.

4.2. Experiment Setup

The AdamW optimizer with a weight decay of 0.1 was used. We used a mini-batch of 64 and an initial learning rate of 5 × 10⁻⁴. We used the CyclicLR scheduler with the AdamW optimizer with a minimum learning rate of 1 × 10⁻⁵ to train all models. The step size of one cycle was set to 80 k iterations. All models were trained with AAM-softmax [31,32], with a margin of 0.2 and softmax prescaling of 30. One cycle was applied to the VoxCeleb2 dev set. To make the size of the feature maps in the Transformer block divisible by w, we chose the optimal size w from 20, 25, and 30 according to evidence in [4]. Each window size was tested on the VoxCeleb1 test set, and the results are analyzed in the results section.

4.3. System Evaluation

We adopted the standard equal error rate (EER) and the minimum normalized detection cost MinDCF as evaluation metrics to compare our proposed system with previous work. For MinDCF calculation, we assumed P_target = 10⁻² and C_FA = C_Miss = 1. EER is mainly composed of the false acceptance rate and false rejection rate, and the value when the two are equal is taken as the evaluation metric. The calculation of MinDCF takes into account the different costs of false rejection and false acceptance as well as the prior probabilities of true speakers and imposters. We also showed the DET curve of our proposed method. All our proposed models use a cosine similarity classifier as a backend framework. We analyze the proposed model architecture with a concise ablation study in the next section.

4.4. Results

In this section, we compare the proposed method with those of several excellent studies and analyze the results. An overview of the system performance is shown in Table 1, including VGG [26], TDNN [8,9,14], ResNet [6], Transformer [11,12,20], and our proposed architecture. According to the results of the VoxCeleb2 speaker verification task shown in Table 1, our model performed better in EER and MinDCF compared with the baseline systems based on VGG, TDNN, ResNet, and Transformer as feature extractors. This shows that enhancing the locality of Transformer can effectively improve performance. The det curve of our proposed method is shown in Figure 3.

When performing local self-attention, we used fixed-size and non-overlapping sliding windows to divide the input features into several equal-length segments. To study the effect of different window sizes on model performance, we set the experimental window size according to the experience of previous work, such as [4]. Table 2 presents the results at different window sizes. Among the experimentally set window sizes, the best results were obtained in the EER and MinDCF indicators when the window size was 25 frames. Performance dropped when the window size became larger (30 frames) or smaller (20 frames). This shows that a reasonable window size improves performance, and when the window size is too large or too small, the model may not be able to capture the speaker information in the current segment well, resulting in performance degradation.

We note that our work outperformed the numbers reported by all systems used for comparison. To investigate the impact of each part of the model, we performed an ablation study on the architecture introduced in Section 3. The results of the ablation experiments are given in Table 3.

We conducted experiment (a) replacing the proposed attention with the original full self-attention and keeping everything else the same. The results showed that our method outperformed the original full self-attention. Our proposed method yielded relative improvements of 6.8% in EER and 4.1% in MinDCF. This suggests that enhancing the locality of the self-attention mechanism in Transformer can improve the model’s performance on the speaker verification task. Experiment (b) clearly demonstrates the importance of the sub-block aggregation net described in Section 3. Aggregating different levels of features through the sub-block aggregation net leads to a relative improvement of 23% in EER and 17.8% in MinDCF. The results showed that aggregating features at different levels through sub-block aggregation net enables the model to obtain richer information, which is beneficial to obtain more robust speaker embedding in our proposed model. In experiment (c), we did not use LePE and kept the other configurations the same. The results showed a relative improvement of 3.5% in EER and 11.5% in MinDCF by introducing LePE. This suggests that further enhancing the locality of the model by positional encoding can effectively improve performance.

5. Conclusions

In this work, we proposed a Transformer-based speaker embedding extractor for speaker verification with a novel global–local self-attention mechanism. The method balances the ability to model long-distance dependencies and capture local features by performing local and global attention in parallel. We aggregated features at different levels in the encoder and decoder to obtain more powerful speaker embeddings. The combination of these designs enables our proposed method to achieve excellent results compared with several strong baselines. Finetuning the parameters and more adequate training may further improve results. In future work, we will further improve the performance by combining this method with other techniques, such as pre-training, while exploring how to better apply Transformer to speaker recognition tasks.

Author Contributions

Methodology, F.X. and D.Z.; software, F.X.; validation, D.Z. and C.L.; writing—original draft preparation, F.X.; writing—review and editing, C.L.; supervision, D.Z.; project administration, D.Z.; funding acquisition, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (No. 2018******4402, 2020YFB1712401).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare that they have no conflict of interest.

References

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
India, M.; Safari, P.; Hernando, J. Double multi-head attention for speaker verification. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 6144–6148. [Google Scholar]
Zhu, H.; Lee, K.A.; Li, H. Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding. In Proceedings of the Conference of the International Speech Communication Association, Brno, Czech Republic, 30 August–3 September 2021; pp. 106–110. [Google Scholar]
Shi, Y.; Chen, M.; Huang, Q.; Hain, T. T-vectors: Weakly supervised speaker identification using hierarchical transformer model. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 6732–6736. [Google Scholar]
Cai, W.; Chen, J.; Li, M. Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System. In Proceedings of the Odyssey 2018 The Speaker and Language Recognition Workshop, d’Olonne, France, 26–29 June 2018; pp. 74–81. [Google Scholar]
Xie, W.; Nagrani, A.; Chung, J.S.; Zisserman, A. Utterance-level aggregation for speaker recognition in the wild. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 5791–5795. [Google Scholar]
Chung, J.S.; Huh, J.; Mun, S. Delving into VoxCeleb: Environment Invariant Speaker Recognition. In Proceedings of the Odyssey 2020 The Speaker and Language Recognition Workshop, Tokyo, Japan, 2–5 November 2020; pp. 349–356. [Google Scholar]
Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-vectors: Robust dnn embeddings for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Calgary, AB, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar]
Okabe, K.; Koshinaka, T.; Shinoda, K. Attentive Statistics Pooling for Deep Speaker Embedding. In Proceedings of the Conference of the International Speech Communication Association, Hyderabad, India, 2–8 September 2018; pp. 2252–2256. [Google Scholar]
El-Moneim, S.A.; Nassar, M.A.; Dessouky, M.I.; Ismail, N.A.; El-Fishawy, A.S.; El-Samie, F.E.A. Text-independent speaker recognition using LSTM-RNN and speech enhancement. Multimed. Tools Appl. 2020, 79, 24013–24028. [Google Scholar] [CrossRef]
Mary, N.J.M.S.; Umesh, S.; Katta, S.V. S-Vectors and TESA: Speaker Embeddings and a Speaker Authenticator Based on Transformer Encoder. TASLP 2022, 30, 404–413. [Google Scholar] [CrossRef]
Wang, R.; Ao, J.; Zhou, L.; Liu, S.; Wei, Z.; Ko, T.; Li, Q.; Zhang, Y. Multi-View Self-Attention Based Transformer for Speaker Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Singapore, 23–27 May 2022; pp. 6732–6736. [Google Scholar]
Zhu, Y.; Ko, T.; Snyder, D.; Mak, B.; Povey, D. Self-attentive speaker embeddings for text-independent speaker verification. In Proceedings of the Conference of the International Speech Communication Association, Hyderabad, India, 2–6 September 2018; pp. 3573–3577. [Google Scholar]
Wu, Y.; Guo, C.; Gao, H.; Hou, X.; Xu, J. Vector-Based Attentive Pooling for Text-Independent Speaker Verification. In Proceedings of the Conference of the International Speech Communication Association, Shanghai, China, 25–29 October 2020; pp. 936–940. [Google Scholar]
Wang, Z.; Yao, K.; Li, X.; Fang, S. Multi-resolution multi-head attention in deep speaker embedding. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 6464–6468. [Google Scholar]
Yu, Y.-Q.; Li, W.-J. Densely Connected Time Delay Neural Network for Speaker Verification. In Proceedings of the Conference of the International Speech Communication Association, Shanghai, China, 25–29 October 2020; pp. 921–925. [Google Scholar]
Jiang, Y.; Song, Y.; McLoughlin, I.; Gao, Z.; Dai, L.-R. An Effective Deep Embedding Learning Architecture for Speaker Verification. In Proceedings of the Conference of the International Speech Communication Association, Graz, Austria, 15–19 September 2019; pp. 4040–4044. [Google Scholar]
Shi, Y.; Huang, Q.; Hain, T. Weakly Supervised Training of Hierarchical Attention Networks for Speaker Identification. In Proceedings of the Conference of the International Speech Communication Association, Shanghai, China, 25–29 October 2020; pp. 2992–2996. [Google Scholar]
Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proceedings of the Conference of the International Speech Communication Association, Shanghai, China, 25–29 October 2020; pp. 3830–3834. [Google Scholar]
Safari, P.; Massana, M.À.I.; Pericás, F.J.H. Self-attention encoding and pooling for speaker recognition. In Proceedings of the Conference of the International Speech Communication Association, Shanghai, China, 25–29 October 2020; pp. 941–945. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows. arXiv 2021, arXiv:2107.00652. [Google Scholar]
Wu, Z.; Liu, Z.; Lin, J.; Lin, Y.; Han, S. Lite Transformer with Long-Short Range Attention. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Nagrani, A.; Chung, J.S.; Zisserman, A. VoxCeleb: A Large-Scale Speaker Identification Dataset. In Proceedings of the Conference of the International Speech Communication Association, Stockholm, Sweden, 20–24 August 2017; pp. 2616–2620. [Google Scholar]
Chung, J.S.; Nagrani, A.; Zisserman, A. VoxCeleb2: Deep Speaker Recognition. In Proceedings of the Conference of the International Speech Communication Association, Hyderabad, India, 2–6 September 2018; pp. 1086–1090. [Google Scholar]
Ravanelli, M.; Parcollet, T.; Plantinga, P.; Rouhe, A.; Cornell, S.; Lugosch, L.; Subakan, C.; Dawalatabad, N.; Heba, A.; Zhong, J.; et al. SpeechBrain: A General-Purpose Speech Toolkit. arXiv 2021, arXiv:2106.04624. [Google Scholar]
Ko, T.; Peddinti, V.; Povey, D.; Seltzer, M.L.; Khudanpur, S. A study on data augmentation of reverberant speech for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, New Orleans, LA, USA, 5–9 March 2017; pp. 5220–5224. [Google Scholar]
Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. Specaugment: A simple data augmentation method for automatic speech recognition. In Proceedings of the Conference of the International Speech Communication Association, Graz, Austria, 15–19 September 2019; pp. 2613–2617. [Google Scholar]
Deng, N.X.J.; Zafeiriou, S. ArcFace: Additive Angular MarginLoss for Deep Face Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4690–4699. [Google Scholar]
Xiang, X.; Wang, S.; Huang, H.; Qian, Y.; Yu, K. Margin matters: Towards more discriminative deep neural network embeddings for speaker recognition. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Lanzhou, China, 18–21 November 2019; pp. 1652–1656. [Google Scholar]

Figure 1. The overall architecture of the proposed Transformer.

Figure 2. (a) The illustration of our proposed Transformer block; (b) the illustration of the global–local self-attention.

Figure 3. The DET curve of our proposed method.

Table 1. Performance comparison with the VoxCeleb1 test set.

Architecture	Extractor	VoxCeleb2 Dev
Architecture	Extractor	EER (%)	MinDCF
MHA [26]	VGG	3.19	0.27
X-vectora [8]	TDNN	3.13	0.33
Atten. Stats. [9]	TDNN	2.59 [14]	0.29
Xie et al. [6]	ResNet	3.22	-
SAEP [20]	Transformer	5.44	-
S-vectors ¹ [11]	Transformer	2.67	0.30
MV [12]	CNN + Transformer	2.56	-
Our work	CNN + Transformer	2.44	0.23

¹ Training dataset is VoxCeleb1 + 2.

Table 2. The window size w changed from 15 to 30 frames.

Architecture	Window Size	VoxCeleb2 Dev
Architecture	Window Size	EER (%)	MinDCF
Our work	20	2.50	0.25
	25	2.44	0.23
	30	2.52	0.27

Table 3. Ablation study of our proposed architecture.

Architecture	Voxceleb2 Dev
Architecture	EER (%)	MinDCF
Our work	2.44	0.23
(a) Original full self-attention	2.62	0.24
(b) No sub-block aggregation net	3.17	0.28
(c) No LePE	2.53	0.26

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xie, F.; Zhang, D.; Liu, C. Global–Local Self-Attention Based Transformer for Speaker Verification. Appl. Sci. 2022, 12, 10154. https://doi.org/10.3390/app121910154

AMA Style

Xie F, Zhang D, Liu C. Global–Local Self-Attention Based Transformer for Speaker Verification. Applied Sciences. 2022; 12(19):10154. https://doi.org/10.3390/app121910154

Chicago/Turabian Style

Xie, Fei, Dalong Zhang, and Chengming Liu. 2022. "Global–Local Self-Attention Based Transformer for Speaker Verification" Applied Sciences 12, no. 19: 10154. https://doi.org/10.3390/app121910154

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Global–Local Self-Attention Based Transformer for Speaker Verification

Abstract

1. Introduction

2. Related Work

3. Proposed Architecture

3.1. Overall Architecture

3.2. Transformer Block

3.2.1. Global–Local Self-Attention

3.2.2. Locally Enhanced Positional Encoding

3.3. Encoder

4. Experiments

4.1. Data and Futures

4.2. Experiment Setup

4.3. System Evaluation

4.4. Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI