Next Article in Journal
An Environmentally Sustainable Approach for Raw Whey Treatment through Sequential Cultivation of Macrophytes and Microalgae
Next Article in Special Issue
Whispered Speech Recognition Based on Audio Data Augmentation and Inverse Filtering
Previous Article in Journal
Investigating the Role of Microclimate and Microorganisms in the Deterioration of Stone Heritage: The Case of Rupestrian Church from Jac, Romania
Previous Article in Special Issue
Gender Recognition Based on the Stacking of Different Acoustic Features
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Speaker-Attributed Training for Multi-Speaker Speech Recognition Using Multi-Stage Encoders and Attention-Weighted Speaker Embedding

by
Minsoo Kim
and
Gil-Jin Jang
*
School of Electronic and Electrical Engineering, Kyungpook National University, Daegu 41566, Republic of Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(18), 8138; https://doi.org/10.3390/app14188138
Submission received: 12 July 2024 / Revised: 2 September 2024 / Accepted: 6 September 2024 / Published: 10 September 2024
(This article belongs to the Special Issue Speech Recognition and Natural Language Processing)

Abstract

:

Featured Application

Speech recognition; speaker adaptation; speaker diarization.

Abstract

Automatic speech recognition (ASR) aims at understanding naturally spoken human speech to be used as text inputs to machines. In multi-speaker environments, where multiple speakers are talking simultaneously with a large amount of overlap, a significant performance degradation may occur with conventional ASR systems if they are trained by recordings of single talkers. This paper proposes a multi-speaker ASR method that incorporates speaker embedding information as an additional input. The embedding information for each of the speakers in the training set was extracted as numeric vectors, and all of the embedding vectors were stacked to construct a total speaker profile matrix. The speaker profile matrix from the training dataset enables finding embedding vectors that are close to the speakers of the input recordings in the test conditions, and it helps to recognize the individual speakers’ voices mixed in the input. Furthermore, the proposed method efficiently reuses the decoder from the existing speaker-independent ASR model, eliminating the need for retraining the entire system. Various speaker embedding methods such as i-vector, d-vector, and x-vector were adopted, and the experimental results show 0.33% and 0.95% absolute (3.9% and 11.5% relative) improvements without and with the speaker profile in the word error rate (WER).

1. Introduction

Automatic speech recognition (ASR) aims at understanding naturally spoken human speech to be used as text inputs to machines so that they can understand the human speaker’s intent and further communicate with the talker. The natural speaking environments include speaking in the presence of ambient sounds, or other non-target human speakers. In multi-speaker environments, where multiple speakers are talking simultaneously with a large amount of overlap, a significant performance degradation may occur with conventional ASR systems if trained by recordings of single talkers. This is called the cocktail party problem [1], and it occurs when the speeches of multiple speakers mix together, making it intractable for traditional single-speaker speech recognition models to understand what each person is independently saying. As an example, a dual speakers condition is shown in Figure 1. Two speakers are talking together, which is given by a single recording, and is denoted by x . The multi-speaker ASR system is expected to generate the original text sets of what the two independent speakers are saying, denoted by y ( 1 ) and y ( 2 ) . We use parenthesized and superscripted numbers for the speaker index on the variables, which is not to be confused with exponential notations. In Figure 1, the overlapped speech x in the middle is quite different from the first speaker’s voice (top left), as well as the second speaker’s (top right). Recovering the original texts, such as “Hello, this is me.” and “She goes to school.”, by listening to the mixed sound x only is very hard even for human listeners, and this is because adding two sounds may create unexpected artifacts and break the original sounding characteristics. However, these multi-speaker conditions are quite common, such as in ordinary conversations or in speaking with background human speakers. Therefore, multi-speaker speech recognition is required to build ASR systems that are utilizable in any circumstance. Conventional noise suppression techniques may not be applicable because they cannot separate sources with the same characteristics.
Many approaches have been proposed to address the challenge of multi-speaker ASR. Techniques such as deep clustering (DPCL) [2] and permutation invariant training (PIT) [3] have been proposed. These techniques are combined with a single-speaker ASR model by bypassing explicit separation outputs as inputs to the ASR model. These methods use either the hybrid ASR framework [4] or the end-to-end ASR framework [5,6]. The other types of approaches in multi-speaker ASR involve the utilization of speaker embeddings. Various models use speaker embedding as an additional input [7] or joint modeling with the speaker recognition model to create a multi-speaker ASR solution. Well-known methods include speaker-attributed training [8,9]. Speaker embedding techniques like x-vectors have been explored to improve the separation and the recognition of overlapped speech. However, these methods encounter challenges, including performance degradation due to speech signal distortion, high computational complexity, and the need for extensive training data. The persistent challenge of recognizing individual speaker speeches in overlapping speech highlights the need for more robust and improved multi-speaker ASR solutions.
To solve the aforementioned challenges, we propose a novel speaker-attributed training (SAT) method. The proposed method incorporates speaker embedding information as an additional input. The embedding information for each of the speakers in the training set is extracted as a numeric vector, and all of the embedding vectors are stacked to construct a speaker profile matrix. The profile matrix from the training dataset enables finding the similarities of the embedding vectors to those of the speakers of the input recordings in test conditions, and it helps to separately recognize the multiple speakers’ voices mixed in the input. Various speaker embedding vectors, such as i-vector [10], d-vector [11], and x-vector [12], are adopted. Furthermore, the proposed method efficiently reuses the decoder from the conventional SAT-ASR (speaker-attributed training for automatic speech recognition) [6] to eliminate the need for fully training the entire model.
Experimental results show that the proposed method improves existing multi-speaker speech recognition models based on the Transformer network [6] in terms of word error rate (WER) on the Librimix dataset [13]. The proposed SAT-ASR model demonstrates improved WERs in multi-speaker speech environments. Without a speaker profile matrix, we observed a 0.33% absolute improvement (3.9% relative improvement) in WER, and with a speaker profile matrix, a 0.95% absolute improvement (11.5% relative) was observed. Section 2 presents related work on speaker embeddings and multi-speaker speech recognition models. Section 3 introduces the proposed SAT multi-speaker ASR model and speaker profile. Section 4 evaluates the performance of this model. Finally, Section 5 concludes with a summary.

2. Related Work

This section describes the conventional methods to solve multi-speaker speech recognition problems. There are many abbreviations used in this manuscript, and all of them are listed at the end of Section 5 in the order of appearance.

2.1. Speaker Embedding

Speaker embedding techniques were used to extract the characteristics of speakers from speech signals, which have proven valuable in various tasks such as speaker verification [14], speaker diarization [15], speech synthesis [16], and speaker separation [17]. Similar to Word2Vec [18] and Wav2Vec [19,20], speaker embedding techniques generate fixed-dimensional numeric vectors from speech signals whose lengths vary over time. One of the more well-known speaker embedding vectors is the intermediate vector (i-vector) [10]. The i-vector is obtained by projecting a universal background model (UBM) of a speaker to characteristically coordinate vectors, which are trained to maximize the likelihood of the projection by exploiting unsupervised learning techniques. This projection maps high-dimensional representation from the UBM to a desired dimensional representation: the i-vector. These i-vectors are applied to speaker verification, facilitating decisions on whether speakers are the same or different [21] using a probabilistic linear discriminant analysis (PLDA) [22] classifier. The x-vector [12] utilizes deep neural networks (DNN) to extract the speaker characteristics from speeches. Another embedding method uses three-layer long short term memory (LSTM) with a projection exploiting a generalized end-to-end loss function [11]. The resultant embedding vectors are called d-vectors, and they have been shown to demonstrate a particularly better performance in text-dependent speaker recognition tasks on datasets with short utterances.

2.2. Conventional Multi-Speaker Speech Recognition Methods

Several methods have been proposed to generate multiple, speaker-specific texts from mixed recordings of multi-talkers with overlapped speeches. The conventional methods can be classified into the following four types.
(a)
A combination of single-speaker ASR models and existing acoustic source separation methods, such as deep clustering (DPCL) [2], permutation invariant training (PIT) [3,23], using x-vectors [24], self-supervised learning [25], and the joint learning of separation masks and speech recognition models [26].
(b)
Directly building a multi-speaker speech recognition system using mixed speech as the input and multiple text streams as learning targets [4,27,28,29].
(c)
The addition of speaker embedding vectors as indicators for the speakers in the input mixture [7,8,9,30] combined with the multi-speaker ASR system in (b) This is called speaker-attributed training for automatic speech recognition (SAT-ASR) [8].
(d)
Addition of a splitting encoder network that generates speaker-specific speech embedding vectors, using them as inputs to the single-speaker ASR decoder [5,6,31]. No retraining of a single-speaker ASR decoder is necessary.
As illustrated in Figure 2, Type (a) is implemented by combining existing source separation and ASR models, and no retraining or fine tuning is required. Its limitation is that the ASR performance degrades according to the distortion in the separated speech signals caused by the adopted monaural source separation method. The ASR performance is no better than the other types with fine tuning [4,5]. The second type constructs a single-input and multiple-output (SIMO) black box, with its inputs and outputs being mixed speech signals and multiple word sequences spoken by the individual speakers, respectively. This SIMO black box is usually implemented by either hybrid deep neural networks [4] or end-to-end ASR [5,6]. This method ensures decent multi-speaker ASR performance if a high enough amount of training data is given. However, each of the training samples should be mixed with all the other speaker samples to simulate all the possible combinations of talkers; thus, the training time may become O ( M N ) , where M is the number of speakers and N is the number of samples, while it is O ( N ) for single-speaker ASR models. The third type, as shown in Figure 2-(c), is called speaker-attributed ASR [8] and appends speaker embedding vectors to the input speech features. The additional embedding vectors let the ASR be informed on the characteristics of overlapped speakers. It is expected to more separately train the multi-speaker ASR models, resulting in faster convergence, as well as improved performance [8,9]. It requires additional training of the speaker encoder with mixed-input recordings so the computational overhead of the whole training is the same order as Type (b)— O ( M N ) .
The last method, Type (d), was proposed to overcome the heavy-computation problem. The single-speaker ASR is constructed by a Transformer network [6], and its decoder is used without retraining. The “Split Encoder” shown in Figure 2-(d) was trained by a set of mixed-speech signals in a way that it can encode the mixed input into multiple representations, H ( 1 ) and H ( 2 ) , to be compatible with the ASR decoder. The multiple text outputs, denoted by y ( 1 ) and y ( 2 ) , where the parenthesized superscripts are speaker indices, were used as target labels, and only the split encoder was fully trained while the SISO ASR decoder was fixed. Training the ASR decoder takes O ( N ) time, as does O ( M N ) for the split encoder, but it takes much less time than the ASR decoder. Therefore, the computational overhead was greatly reduced when compared to the SIMO ASR approach.
This paper proposes a novel multi-speaker ASR method that takes advantages of Types (c) and (d): a combination of speaker-attributed training and encoder–decoder-based multi-speaker ASR. Speaker embedding vectors from the speaker encoder trained by the single-talker data in (c) were concatenated to the input of the split encoder of (d) to help splitting the mixed speech and inferring the encoder output to be closer to the single-speaker representations. The proposed method is computationally efficient, and permutation invariant training was not required because the speaker embedding vectors prevent permutation ambiguity. The implementation details of the proposed method are described in Section 3.

3. Elaborated Model of the Proposed Methods

3.1. Proposed Speaker-Attributed Training

The proposed model combines automatic speech recognition (ASR) using the encoder–decoder architecture of the Transformer network [6] and speaker-attributed training with speaker embedding vectors [7,8]. As shown in Figure 3, speaker encoder and decoder blocks of a Transformer network, Enc speaker and Dec speaker , were trained using the ASR training dataset. The subscript ‘ speaker ’ indicates that the encoder–decoder extracts speaker embedding vectors from the input recording. The training samples were single-talker utterances, and their targets were speaker identity labels that were used to obtain representations for the speaker differences in order to be maximally emphasized. The speaker embedding block consists of the trained Enc speaker and Dec speaker , which is represented as
X = { x 1 , x 2 , , x T } ,
E = { e 1 , e 2 , , e T } = Enc speaker X ,
Q mix = { q 1 , q 2 , , q T } = Dec speaker ( E ) ,
where T is the number of the feature vectors from the time-varying input recording, x t is the extracted feature at the analysis frame index t, e t is the encoder output, and q t is the extracted speaker embedding vector at the index t. The embedding vector q t is also referred to as speaker query vector [8,32] because it identifies the speaker in the input speech. The capital notations, such as X , E , and Q mix , represent the set of all T vectors, and they are often implemented by matrices. We added ‘ mix ’ to E as a subscript to imply that this set of speaker embedding vectors is used by the ASR encoder Enc mix .
In the case of the d-vector method [33,34], the extraction was autoregressive, so the embedding vector at a time was dependent on the previous vectors only, and this can be expressed by
q t = Dec speaker ( d v e c ) q 1 , q t 1 , e t , t = 1 , , T .
As show in Figure 3, the ASR encoder was divided into three encoders with different roles. The acoustic feature X and the speaker embedding Q mix were concatenated and were called vertical stacking [7]. The first encoder Enc mix maps the concatenated input to a higher-dimensional representation H mix ,
H mix = Enc mix X , Q mix .
The role of Enc mix is in emphasizing the information of the speakers by merging the speaker embedding Q mix and X . From the higher-dimensional representation H mix , the speaker-differentiating encoder Enc diff ( s ) extracted the speech of each speaker,
H ( s ) = Enc diff ( s ) H mix , s = 1 , , S ,
where s and S represent the speaker index and the total number of speakers, respectively. As shown in Figure 3, S = 2 . Enc diff ( s ) splits the mixed features of S speakers so that single-speaker information remains in each H ( s ) . The last Enc rec transforms each H ( s ) into the embeddings G ( s ) to be compatible with the pretrained decoder Dec rec . The encoder weights were shared over all the input H ( s ) . Finally, Dec rec maps each G ( s ) to the output y ( s ) , which is typically a word sequence for ASR. The pretrained decoder Dec rec was fixed and shared over all of the input.
G ( s ) = Enc rec H ( s ) ,
y ( s ) = Dec rec G ( s ) .
The three-stage encoder design has several advantages. The recognition encoder, Enc rec , is focused on matching the previous encoder output to the recognition decoder Dec rec . Their weights are shared so that the number of trainable parameters can be reduced. The speaker splitting is performed by Enc diff , and merging the speaker embedding vectors and speech features is performed by Enc mix only. By assigning different roles, it can provide faster training and improved performance. Another advantage is that the pretrained ASR decoder and speaker encoder–decoder is used without retraining. Moreover, those modules are trained by single-talker recordings, and only encoders require multi-talker recordings for fine tuning, as shown by the dotted box in Figure 3.
The whole procedure of the proposed speaker-attributed training is shown in Algorithms 1–3. The Procedure Forward in Algorithm 1 is a step-by-step illustration of the forward inference in Figure 3. The inputs are a set of mixed feature vectors X in Equation (1), where the parameter sets are Θ 0 and Θ 1 and the outputs are the predicted word sequences Y ^ for the given input X . Θ 0 consists of a speaker encoder and decoder to extract speaker embedding vectors, i.e., Enc speaker and Dec speaker , and the pretrained speech recognition decoder is Dec rec . Θ 1 is a set of Enc mix , Enc diff ( 1 S ) , and Enc rec , which are required by Equations (5)–(7). The Procedure Backward in Algorithm 2 requires the predicted and target word sequences Y ^ and Y . It first computes the loss between Y ^ and Y , and then it performs a single backpropagation learning step to minimize the loss (denoted by L).
The Procedure SAT in Algorithm 3 defines the whole training process. It requires an input of mixed features X , target word sequences Y , and parameter set Θ 0 . These inputs are fixed throughout the entire training process. The output of Procedure SAT, Θ 1 , is a set of a mixed encoder, speaker-differentiation encoder, and recognition encoder. In Line 2 of Algorithm 3, they are created by an appropriate initialization method. In Lines 4–5, the input features X is passed through the network constructed by the parameters Θ 0 and Θ 1 by calling a sub-procedure Forward, and set Θ 1 is updated by calling Backward. These are repeated until the convergence of the training loss L. The detailed implementation is publicly available at https://github.com/craft8244/MSSR, accessed on 17 August 2024. Although Figure 3 shows architectures for two-speaker mixing cases, the proposed method can be applied to more than two speakers cases by replicating Enc diff only. In the case of the other encoders, Enc mix and Enc rec , only a single instance is required, so there is no architectural change according to the number of speakers. Once adequate training samples with the desired number of speakers are given, the proposed method can be easily applied.
Algorithm 1 Speaker-attributed forward inference
Require:  X = x 1 x T , Θ 0 , Θ 1 ▹ overlapped speech and transformer encoder/decoders
1:
procedure Forward( X , Θ 0 , Θ 1 ):
2:
     Enc speaker , Dec speaker , Dec rec Θ 0
3:
     Enc mix , Enc diff ( 1 S ) , Enc rec Θ 1            ▹ fetch transformer parameters
4:
     E Enc speaker ( X )                   ▹ generate speaker encoder output
5:
     Q mix Dec speaker ( E )      ▹ convert to a set of multi-speaker embedding vectors
6:
     ( X , Q mix )             ▹ concatenate embedding and overlapped speech
7:
     H mix Enc mix X , Q mix               ▹ map to high-dimensional features
8:
    for  s = 1 , , S  do:
9:
         H ( s ) Enc diff ( s ) H mix                  ▹ extract single-speaker features
10:
         G ( s ) Enc rec H ( s )              ▹ convert to decoder-compatible features
11:
         y ^ ( s ) Dec rec G ( s )             ▹ generate speech recognition outputs
12:
    end for
13:
    return  Y ^ = y ^ ( 1 ) , y ^ ( S )
14:
end procedure
Algorithm 2 Backward learning
Require:  Y ^ , Y , Θ 1    ▹ predicted and target word sequences, and parameters to update
1:
procedure Backward( Y ^ , Y , Θ 1 ):
2:
     L  LOSS ( Y ^ , Y )   ▹ compute the loss between estimated and true word sequences
3:
     Θ 1  Backprop ( Θ 1 , L )     ▹ perform backpropagation learning to update parameters minimizing L
4:
    return L, Θ 1
5:
end procedure
Algorithm 3 Speaker-attributed training (SAT)
Require:  X = x 1 x T , Y = y ( 1 ) y ( S ) , Θ 0 = Enc speaker , Dec speaker , Dec rec  ▹ overlapped speech, multiple target word sequences, pretrained transformer encoder/decoders
1:
procedure SAT( X , Y , Θ 0 ):
2:
     Θ 1 INIT Enc mix , Enc diff ( 1 S ) , Enc rec    ▹ Initialize multi-speaker encoders
3:
    repeat
4:
         Y ^ Forward ( X , Θ 0 , Θ 1 )
5:
         L, Θ 1 Backward ( Y ^ , Y , Θ 1 )
6:
    until L converges
7:
    return  Θ 1
8:
end procedure

3.2. Speaker-Attributed Training Using Speaker Profiles

Because speaker embedding vectors are provided as the input to Enc mix , they should be as relevant to the speakers as possible in order to obtain optimal splitting performance. However, because Enc speaker and Dec speaker were trained by single-talker recordings, the extracted speaker embedding vectors may not have reliably modeled the mixed speakers. Therefore, we adopted attention-weighted averaging of the speaker profiles [8,32,35] to better match the input to the speaker embedding vectors in the target profile.
Through using a speaker encoder–decoder on the training data, speaker embedding vectors were extracted, and those embedding vectors were clustered into a predefined number of profile vectors [32], which are denoted as P = { p 1 , p 2 , , p K } , where K is the number of profile vectors. In this paper, P was implemented by a matrix, where each column represents a profile embedding vector, and the number of profiles used was 40. For the input query vector q t , its closeness to profile k was computed by a cosine similarity between two vectors as follows [32]:
b t , k = q t · p k q t p k , t = 1 , 2 , T ,
where the operator · is an inner product, t is the frame index over time (in our work, it was extracted at every 10 milliseconds), k is the profile vector index in the profile matrix P , and T is the number of speaker embedding vectors. In our implementation, the inner product q t · p k was efficiently computed by performing Q T P , where T is the matrix transpose operator that generates T × K inner products in a single step with a GPU. The cosine similarity between embedding vector q t and profile vector p k , b t , k ranged from −1 to 1, so the simple linear weighting may be greatly varied. Therefore, b t , k was scaled by a softmax normalization so that the resultant attention weight β t , k ranged from 0 and 1:
β t , k = exp b t , k j = 1 K exp b t , j .
The attention-weighted speaker profile q ˜ t was calculated as a linear combination of the profile vectors weighted by β t , k as
q ˜ t = k = 1 K β t , k p k ,
Q ˜ m i x = q ˜ 1 , q ˜ 2 , , q ˜ T .
The speaker embedding set Q m i x in Equation (5) was replaced with Q ˜ m i x in Equation (12). Once the profile matrix P was prepared, only cosine similarity computation was required, resulting in minimal computational overhead during the inference step. Figure 4 illustrates the speaker-attributed ASR using speaker profiles. The speaker profiling block was added to replace the speaker embedding vector with the weighted average of the profile vectors. The speaker profiling was expected to convert the unknown speaker embedding into a linear combination of known speaker embedding vectors; therefore, Enc mix may have better split the speaker information into H ( 1 ) and H ( 2 ) .
Procedure Beta in Algorithm 4 is a step-by-step description of the attention-weighted speaker embedding vector derivation in Equations (9)–(11). The computed similarities represent the attention of the q to the enrolled profile vectors. It computes cosine similarities for all embedding vectors in the profile matrix P , and it replaces the speaker embedding vector q with the linear combination of the profile embedding vectors. Procedure Forward in Algorithm 5 is a forward inference with a profile matrix. Most of the steps are identical to Algorithm 1 except that the speaker embedding Q mix of the mixed-speech features was replaced with the attention-weighted embedding vectors, as shown in Lines 6–9. There was no change in the backpropagation steps and training iterations, so Algorithms 2 and 3 can be used without any modification. The detailed implementation is also publicly available at https://github.com/craft8244/MSSR, accessed on 17 August 2024.
Algorithm 4 Attention-weighted speaker embedding vector extraction
Require:  q , Θ 1 , P     ▹ speaker embedding vector, speaker profile matrix
1:
procedure Beta( q , P ):
2:
     K | P |        ▹ number of profile vectors
3:
    for  k = 1 , , K  do:
4:
         b k q · p k / q p k       ▹ compute cosine similarity
5:
    end for
6:
    for  k = 1 , , K  do:
7:
         β k exp b t , k / j = 1 K exp b t , j       ▹ softmax normalization
8:
    end for
9:
    return  q ˜ = k = 1 K β k p k     ▹ return attention-weighted speaker embedding
10:
end procedure
Algorithm 5 Forward inference using speaker profile
Require:  X = x 1 x T , Θ 0 , Θ 1 , P     ▹ overlapped speech and transformer encoder/decoders, speaker profile matrix
1:
procedure Forward( X , Θ 0 , Θ 1 , P ):
2:
     Enc speaker , Dec speaker , Dec rec Θ 0
3:
     Enc mix , Enc diff ( 1 S ) , Enc rec Θ 1       ▹ fetch transformer parameters
4:
     E Enc speaker ( X )     ▹ generate a matrix of multi-speaker encoder outputs
5:
     q 1 q T Dec speaker ( E )    ▹ convert to a matrix of embedding vectors
6:
    for  t = 1 , , T  do:
7:
         q ˜ t B e t a ( q t , P )         ▹ attention-weighting
8:
    end for
9:
     Q ˜ m i x = q ˜ 1 q ˜ T     ▹ construct attention-weighted embedding matrix
10:
     H mix Enc mix X , Q ˜ mix      ▹ map to high-dimensional features
11:
    for  s = 1 , , S  do:
12:
         H ( s ) Enc diff ( s ) H mix       ▹ extract single-speaker features
13:
         G ( s ) Enc rec H ( s )     ▹ convert to decoder-compatible features
14:
         y ^ ( s ) Dec rec G ( s )      ▹ generate speech recognition outputs
15:
    end for
16:
    return  Y ^ = y ^ ( 1 ) , y ^ ( S )
17:
end procedure

4. Experimental Performance Evaluation

The proposed method was evaluated by its speech recognition performance on the LibriMix dataset [13], where each audio sample contains only two speakers. Specifically, the speech recognition performance was evaluated based on the speaker embeddings from the speech profiles used in the proposed method, as well as the number of utterances extracted for the profile. Furthermore, we compared the speech recognition performance across three different systems using speaker embeddings and speaker profiles: single-speaker ASR, existing multi-speaker ASR, and the proposed SAT-ASR methods.

4.1. Experimental Setup

The LibriMix dataset is an open-source, single-channel, and multi-speaker speech dataset derived from LibriSpeech [36]. It is composed of mixtures involving two or three speakers, along with ambient noise samples from WHAM! [37]. This dataset is constructed by mixing single-speaker recordings from LibriSpeech with partial overlap and integrated noise. The dataset is divided into two primary collections [13]: Libri2Mix, which features two speakers; and Libri3Mix, with three speakers. We carried out experiments on Libri2Mix only because two-speaker mixing is much more common in real environments. Each of these subsets include versions with and without WHAM! noise, leading to ‘clean’ data free from noise and ‘both’ data containing noise. There are four versions of the dataset available, differentiated by two sampling rates (16 kHz and 8 kHz) and two modes (min and max). In min mode, the mixture concludes at the end of the shortest utterance, while in max mode, the shortest utterance is extended to match the length of the longest one. For our proposed model, which is designed for scenarios with two speakers, we utilize the Libri2Mix subset of LibriMix. We specifically chose the ‘clean’ dataset without noise at a 16 kHz sampling rate and in max mode to maximize the inclusion of speech data for training. Table 1 presents the statistics of the evaluation datasets. The training dataset consists of 64,700 utterances, amounting to approximately 270 h. The development dataset consists of 3000 utterances, totaling 11 h. Similarly, the test set includes 3000 utterances, also spanning 11 h, and is completely independent from the training and development sets. The dataset includes the following six different rates of speech overlap: 0%, 20%, 40%, 60%, 80%, and 100%. The maximum duration of each utterance is 15 s. The training data, development data, and test data consist of 1172, 40, and 40 independent speakers, respectively.
In Section 4.3, the evaluation results from the three types of speaker embeddings, i-vector, d-vector, and x-vector, are shown. The intermediate vector (i-vector) extractor [12] was constructed using Gaussian mixture models (GMM) based on the universal background model (UBM). We extracted 400-dimensional speaker embeddings using the Kaldi toolkit [38,39]. The pretrained i-vector extractor was obtained from the Kaldi official website [38]. The x-vector extractor [12], based on a deep neural network, was utilized to extract 512-dimensional speaker embedding vectors from the input utterances. We employed a pretrained x-vector extractor from the Kaldi official website. Both of these models were trained on the augmented VoxCeleb 1 and 2 datasets [40]. The input features for both extractors were 30-dimensional mel-frequency cepstral coefficients (MFCCs), with a frame length of 25 milliseconds and a shift size of 10 milliseconds. The x-vectors were only extracted from voice frames, which were selected by an energy-based voice activity detector. L 2 -normalized speaker embeddings were then used as an additional input to the speech recognition decoder. The d-vector extractor [11] was based on a recurrent neural network implemented in PyTorch [41]. We extracted 256-dimensional speaker embedding vectors. The input features for the extractor were 40-dimensional mel-scale log spectrograms, with a frame length of 25 milliseconds and a frame shift size of 10 milliseconds. The extractor was configured with a three-layer LSTM, where each layer had a hidden dimension of 256.

4.2. ASR Model Training

All proposed multi-speaker speech recognition models were implemented by the ESPnet framework [42] with the PyTorch [41] backend. The acoustic features consisted of 80-dimensional mel-scale log filterbank energies, including pitch features, deltas, and delta-deltas. The input features were created by concatenating the speaker embedding vector with the acoustic features. Consequently, the dimension of the input feature depended on the type of speaker embedding vectors adopted. For instance, in the case of i-vectors, the input dimension became 480 (400 for the speaker embedding and 80 for the acoustic feature). In the case of x-vectors, the dimension was 592 (512 + 80), and it was 336 (256 + 80) for the d-vectors. In the Transformer-based multi-speaker decoder ASR model, there are 18 layers in the conventional method [43,44].
For the proposed methods, Enc mix acts as the CNN (convolutional neural network) embedding layer. Enc diff and Enc rec consist of four and eight Transformer layers, respectively. For all Transformers, the attention size was set to 256, the number of attention heads to 4, and the fully-connected feed-forward network size to 2048. The training of the Transformer model was conducted with the Adam optimizer and Noam learning rate decay, as outlined in [3]. The backend ASR, Dec rec , was initialized with a pretrained single-speaker speech recognition model from the ESPnet recipe of the LibriSpeech dataset. During training, it was kept frozen for the first 15 epochs to ensure stability. At the end of all ASR model training, a Transformer language model from the ESPnet recipe of the LibriSpeech dataset was employed on the pretrained single-speaker LibriSpeech dataset model without fine tuning. Similarly, as introduced in Section 4.1, all speaker embedding extractors were also used without additional training on the pretrained models from the VoxCeleb 1 and 2 speaker recognition datasets [40].
To train the baseline ASR model and the proposed models, we used four units of NVIDIA RTX TITAN with a 24 GB GPU memory. The single-speaker ASR model based on Transformer architecture has 113 million parameters, and its training data consist of 460 h of train-clean-100 and train-clean-360 subsets of LibriSpeech [36]. In terms of training time, it required 223 h, which is equivalent to 892 GPU hours because four GPUs were used. The baseline multi-speaker ASR [43,44] has 90.6 million parameters, and 30.6 million of them were taken from a single-speaker ASR Transformer model. The remaining 60 million parameters were trained by multi-speaker utterances. The multi-speaker training set consisted of 270 h, as shown in Table 1, and it took 320 GPU hours to train the multi-speaker modules. For the proposed SAT methods, 0, 11.2, and 4.2 million additional parameters were required for the i-, d-, and x-vectors, respectively. They took 320, 336, and 328 additional GPU hours to train. The number of trainable parameters and the total training hours are compared in detail in Table 2.

4.3. Evaluation Results and Discussions

To determine the best-performing speaker embedding and profile settings for the proposed SAT-ASR model, we evaluated the performance of various profile and speaker embedding configurations. The first set of experiments confirmed performance changes depending on the number of utterances used per profile. Figure 5 shows the word error rate (WER) variations according to the number of profile utterances. In this experiment, the SAT-ASR model using x-vector speaker profiles was used for evaluation. The x-axis is the number of utterances per profile. When the number of utterances (single-speaker samples) for each profile was 1, the lowest performance was 8.61% WER. When the number of utterances per profile was set to 5, 10, and 20, the performance improved to 7.83%, 7.57%, and 7.48%, respectively. Performance increased as the number of utterances for each profile increased, but it remained the same after 20. Based on these results, 20 utterances per profile were used in the speech recognition performance evaluation.
Table 2 compares the proposed SAT-ASR models with conventional models. The number of trainable parameters, training time, and test WER are listed. For the single-speaker automatic speech recognition (SS-ASR) model, we adopted a simple Transformer-based architecture [44] because the proposed SAT-ASR methods require pretrained decoders from the SS-ASR. Its implementation recipe was retrieved from ESPnet repository at https://github.com/espnet/espnet/tree/master/egs2/librispeech/asr1, accessed on 20 July 2024, where WER was reported for the single-speaker input and the test-clean set of LibriSpeech was 2.5. The SS-ASR model has 113 million (denoted by `M’ in Table 2) trainable parameters, and the total training data consists of 460 h of the train-clean subset of LibriSpeech [36]. It took 892 GPU hours to train all the parameters from scratch. As shown in Table 1, the test data were 3000 utterances of the test-clean subset of Libri2Mix generated by LibriMix. The evaluated WER was 76.9%, which is significantly worse than the LibriSpeech test dataset. The huge degradation was due to the mismatch between the stingle-speaker ASR model and multi-speaker input recordings. The second model uses a DPRNN-TasNet (dual-path recurrent neural network-based time-domain audio separation network) [45] to separate the speakers in the mixed-speech input. After separation, the word sequence of each output was generated using the SS-ASR, and the WER of each channel was measured. Because the pretrained SS-ASR was used, only a DPRNN-TasNet was required to be trained, so the number of trainable parameters was only 2.6 million. It took 212 GPU hours to train the DPRNN-TasNet model measured by the same NVDIAI GTX Titan GPU. The WER on the test-clean set of Libri2Mix was 13.6%, which a significant improvement on SS-ASR.
Table 2. Performance comparison of the baseline model and the proposed models in terms of the WER and their relative improvements (shown in parentheses) from the baseline. For the compact notation, the following abbreviations were used: SS-ASR (single-speaker automatic speech recognition); MS-ASR (multi-speaker automatic speech recognition); SAT (speaker-attributed training); and DPRNN-TasNet (dual-path recurrent neural network-based time-domain audio separation network). The second and third columns show the number of parameters to be trained (where `M’ stands for million) and the time required by a single NVIDIA GTX Titan GPU to train those parameters, respectively. The forth column shows the word error rates on the test set.
Table 2. Performance comparison of the baseline model and the proposed models in terms of the WER and their relative improvements (shown in parentheses) from the baseline. For the compact notation, the following abbreviations were used: SS-ASR (single-speaker automatic speech recognition); MS-ASR (multi-speaker automatic speech recognition); SAT (speaker-attributed training); and DPRNN-TasNet (dual-path recurrent neural network-based time-domain audio separation network). The second and third columns show the number of parameters to be trained (where `M’ stands for million) and the time required by a single NVIDIA GTX Titan GPU to train those parameters, respectively. The forth column shows the word error rates on the test set.
ModelTrainableTraining TimeTest
Parameters(GPU Hours)WER (%)
SS-ASR [44]113 M89276.90
DPRNN-TasNet + SS-ASR [45]2.6 M21213.60
MS-ASR [6]60 M3208.43 (-)
MS-ASR + SATi-vector60 M3208.50 (−0.8)
(proposed)d-vector71.2 M3368.32 (1.3)
x-vector64.2 M3288.10 (3.9)
MS-ASR + SATi-vector60 M3208.27 (1.9)
+ speaker profiled-vector71.2 M3368.18 (3.0)
(proposed)x-vector64.2 M3287.48 (11.2)
The third row is the Transformer-based multi-speaker automatic speech recognition (MS-ASR) method [6]. This model corresponds to Figure 2-(b), which uses a split encoder for the multi-speaker speech inputs. It requires training a split encoder ( Enc mix in Figure 3), speaker differentiation encoder ( Enc diff ), and recognition encoder ( Enc rec ). The total number of trainable parameters was about 60 million, and they were trained with the train-clean subset of Libri2Mix, whose size is 64,700 utterances and 270 h of mixed speech, as shown in Table 1. The training time was 320 GPU hours, excluding the time for the pretrained Enc speaker , Dec speaker , and Dec rec in Figure 3. The evaluated WER on the same test set was 8.43%, which represents a significant improvement over the other conventional methods.
The next three rows are the results of the proposed method described in Section 3.1 and correspond to Figure 3. Three different types of speaker embedding vectors were concatenated with the input features of mixed speech to emphasize the information of the speakers. Then, the proposed speaker-attributed training (SAT) technique in Algorithm 3 was applied to the baseline MS-ASR. The number of trainable parameters for the i-vectors was the same as that of the baseline MS-ASR because there were no additional layers. In the case of the d- and x-vectors, 11.2 and 4.2 million trainable parameters were added. The training times with the same train-clean set of Libri2Mix were 320, 336, and 328 GPU hours, respectively. The increased training time was less than 5% compared to the baseline. However, these methods also require the decoders of the pretrained SS-ASR model so one should also consider the overhead of training SS-ASR if it is not available. The WER of the i-vectors on the test set was 8.5%, which is worse than the baseline’s WER of 8.43%. The d-vectors’ WER was 8.32%, whose relative improvement was 1.3% from the baseline. For the x-vectors, a WER of 8.1% and 3.9% relative improvement were obtained, and they achieved the best results among the methods with speaker-attributed training using speaker embedding vectors directly.
The bottom three rows present the results of the method described in Section 3.2 and are shown in Figure 4. The same speaker embedding vectors were used, but the extracted vectors were replaced by the linear combinations of the profile embedding vectors from the 20 selected utterances. The differences from the method in Section 3.1 were due to the attention weighting provided by Algorithm 4 and the forward inference by Algorithm 5. The storage size and computation time of the profile matrix was almost negligible so the number of trainable parameters and training times were almost the same as SAT without profile weighting. The evaluated WERs on the test set were 8.27%, 8.18%, and 7.48% for the i-, d-, and x-vectors, respectively. All of the WERs were improved with the addition of speaker profiles. In particular, the x-vectors achieved a 11.2% relative improvement from the baseline, which was the best among all the cases.
Table 3 shows an example output generated by the proposed SAT-ASR with a profile. The input was the mixed speech of one speaker saying Label-A and the other saying Label-B, and Result-1 and Result-2 were the output word sequences. The word sequences were generated using the token level text vocabulary of a Transformer-based language model [43], so there were several non-language tokens such as <eos>, meaning end of sentence. When comparing Label-A and Result-1, only a single error was observed: OF → BUT. When comparing Label-B and Result-2, only a single error was also observed: TEN → AND. In summary, the experimental results show that the proposed speaker-attributed training was highly effective in multi-speaker speech recognition tasks without adding significant storage and computational overheads.

5. Conclusions

This paper presents novel solutions to improve speech recognition performance in multi-speaker speech environments. In such settings, where multiple speeches occur simultaneously, traditional speech recognition systems are hard to distinguish between different speakers. In scenarios with overlapping speeches, we introduced the speaker-attributed training automatic speech recognition (SAT-ASR) model, featuring a novel structure and incorporating additional inputs such as speaker embeddings and a speaker profile method. We proposed using i-, d-, and x-vectors as effective cues to split the multiple-speaker information mixed in the input features. We also proposed utilizing a profile matrix constructed by the embedding vectors of the enrolled speakers. Because speaker embedding vectors are extracted from mixed speech, they are regularized by being replaced with the linearly weighted combinations of the enrolled embedding vectors. The effectiveness of the proposed method was verified by multi-speaker speech recognition experiments on LibriMix datasets. A comparative analysis of using the speech separation implemented by DPRNN-TasNet as inputs to the single-speaker ASR model, conventional multi-speaker ASR models implemented by Transformer networks, and the proposed SAT-ASR models were carried out. The word error rates (WERs) measured on a designated test set of LibriMix were 13.6%, 8.43%, and 8.1%, respectively. When employing speaker profiles, the SAT-ASR model further improved its WER to 7.48%, surpassing all the other multi-speaker ASR solutions. The computational efficiencies were also compared in terms of the number of trainable parameters and training times. The proposed methods required 336 and 328 GPU hours with and without speaker profiles, while the baseline multi-speaker model required 320 GPU hours, resulting in up to 5% additional training time.
One of the major advantages of the proposed methods was that the pretrained components of the single-speaker ASR models can be reused without fine tuning by multi-speaker utterances. Those parts were speech recognition encoders yielding text outputs. Training a single-speaker ASR model from scratch required 892 h, but it was not required to be updated again. These methods not only improved the accuracy of the speech recognition systems, but they were also adaptable to a variety of complex speech scenarios with a small amount of computational overhead.
Future research will focus on expanding the SAT-ASR model to be applicable to scenarios with an arbitrary number of speakers. To achieve this, we will explore techniques for automatically detecting the number of speakers in an audio input and adjusting the encoder of the proposed model accordingly. Additionally, we aim to improve the model’s structure to ensure that, when recognizing a single-speaker speech in multi-speaker environments, the computational load is comparable to that of conventional single-speaker ASR models. Through these plans, we intend to achieve more accurate multi-speaker speech recognition in complex real-world environments.

Author Contributions

Conceptualization, M.K. and G.-J.J.; methodology, M.K.; software, M.K.; validation, M.K. and G.-J.J.; formal analysis, M.K. and G.J; investigation, G.-J.J.; resources, G.-J.J.; data curation, M.K.; writing—original draft preparation, M.K.; writing—review and editing, M.K. and G.-J.J.; visualization, M.K.; supervision, G.-J.J.; project administration, G.-J.J.; funding acquisition, G.-J.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (no. RS-2021-II212068, Artificial Intelligence Innovation Hub; and R7124-16-0004, Development of Intelligent Interaction Technology Based on Context Awareness and Human Intention Understanding).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in https://github.com/JorisCos/LibriMix at https://doi.org/10.48550/arXiv.2005.11262 (accessed on 20 July 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ASRAutomatic Speech Recognition
WERWord Error Rate
DPCLDeep Clustering
PITPermutation Invariant Training
SATSpeaker-Attributed Training
SAT-ASRSpeaker-Attributed Training for Automatic Speech Recognition
UBMUniversal Background Model
GMMGaussian Mixture Model
PLDAProbabilistic Linear Discriminant Analysis
DNNDeep Neural Network
LSTMLong Short-Term Memory
SISOSingle Input Single Output
SIMOSingle Input Multiple Outputs
MFCCMel-Frequency Cepstral Coefficient
CNNConvolutional Neural Network
SS-ASRSingle-Speaker Automatic Speech Recognition
MS-ASRSingle-Speaker Automatic Speech Recognition
DPRNNDual-Path Recurrent Neural Network
TasNetTime-Domain Audio Separation Network

References

  1. Cherry, E.C. Some experiments on the recognition of speech, with one and with two ears. J. Acoust. Soc. Am. 1953, 25, 975–979. [Google Scholar] [CrossRef]
  2. Hershey, J.R.; Chen, Z.; Roux, J.L.; Watanabe, S. Deep clustering: Discriminative embeddings for segmentation and separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 31–35. [Google Scholar] [CrossRef]
  3. Yu, D.; Kolbæk, M.; Tan, Z.H.; Jensen, J. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 241–245. [Google Scholar] [CrossRef]
  4. Chang, X.; Qian, Y.; Yu, D. Monaural multi-talker speech recognition with attention mechanism and gated convolutional networks. In Proceedings of the Interspeech, ISCA, Hyderabad, India, 2–6 September 2018; pp. 1586–1590. [Google Scholar] [CrossRef]
  5. Chang, X.; Qian, Y.; Yu, K.; Watanabe, S. End-to-end monaural multi-speaker ASR system without pretraining. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6256–6260. [Google Scholar] [CrossRef]
  6. Chang, X.; Zhang, W.; Qian, Y.; Roux, J.L.; Watanabe, S. End-to-end multi-speaker speech recognition with transformer. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6134–6138. [Google Scholar] [CrossRef]
  7. Denisov, P.; Vu, N.T. End-to-end multi-speaker speech recognition using speaker embeddings and transfer learning. In Proceedings of the Interspeech, ISCA, Graz, Austria, 15–19 September 2019; pp. 4425–4429. [Google Scholar] [CrossRef]
  8. Kanda, N.; Ye, G.; Gaur, Y.; Wang, X.; Meng, Z.; Chen, Z.; Yoshioka, T. End-to-end speaker-attributed ASR with transformer. In Proceedings of the Interspeech, ISCA, Brno, Czechia, 30 August–3 September 2021; pp. 4413–4417. [Google Scholar] [CrossRef]
  9. Kanda, N.; Wu, J.; Wu, Y.; Xiao, X.; Meng, Z.; Wang, X.; Gaur, Y.; Chen, Z.; Li, J.; Yoshioka, T. Streaming speaker-attributed ASR with token-level speaker embeddings. In Proceedings of the Interspeech, ISCA, Incheon, Korea, 18–22 September 2022; pp. 521–525. [Google Scholar] [CrossRef]
  10. Dehak, N.; Kenny, P.J.; Dehak, R.; Dumouche, P.; Ouellet, P. Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 788–798. [Google Scholar] [CrossRef]
  11. Wan, L.; Wang, Q.; Papir, A.; Moreno, I.L. Generalized end-to-end loss for speaker verification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4879–4883. [Google Scholar] [CrossRef]
  12. Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. x-vectors: Robust DNN embeddings for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar] [CrossRef]
  13. Cosentino, J.; Pariente, M.; Cornell, S.; Deleforge, A.; Vincent, E. LibriMix: An open-source dataset for generalizable speech separation. arXiv 2020, arXiv:2005.11262. [Google Scholar] [CrossRef]
  14. Zhu, Y.; Ko, T.; Snyder, D.; Mak, B.; Povey, D. Self-attentive speaker embeddings for text-independent speaker verification. In Proceedings of the Interspeech, ISCA, Hyderabad, India, 2–6 September 2018; pp. 3573–3577. [Google Scholar] [CrossRef]
  15. Park, T.J.; Kanda, N.; Dimitriadis, D.; Han, K.J.; Watanabe, S.; Narayanan, S. A review of speaker diarization: Recent advances with deep learning. Comput. Speech Lang. 2022, 72. [Google Scholar] [CrossRef]
  16. Ning, Y.; He, S.; Wu, Z.; Xing, C.; Zhang, L.J. A review of deep learning based speech synthesis. Appl. Sci. 2019, 9, 4050. [Google Scholar] [CrossRef]
  17. Wang, D.; Chen, J. Supervised speech separation based on deep learning: An overview. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 1702–1726. [Google Scholar] [CrossRef] [PubMed]
  18. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2013, 26, 3111–3119. [Google Scholar]
  19. Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised pre-training for speech recognition. In Proceedings of the Interspeech, ISCA, Graz, Austria, 15–19 September 2019; pp. 3465–3469. [Google Scholar] [CrossRef]
  20. Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
  21. Garcia-Romero, D.; Espy-Wilson, C.Y. Analysis of i-vector length normalization in speaker recognition systems. In Proceedings of the Interspeech, ISCA, Florence, Italy, 27–31 August 2011; pp. 249–252. [Google Scholar] [CrossRef]
  22. Ioffe, S. Probabilistic linear discriminant analysis. In Proceedings of the European Conference on Computer Vision (ECCV), Graz, Austria, 7–13 May 2006; Volume 9, pp. 531–542. [Google Scholar] [CrossRef]
  23. Kolbæk, M.; Yu, D.; Tan, Z.H.; Jensen, J. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1901–1913. [Google Scholar] [CrossRef]
  24. Gu, Z.; Liao, L.; Chen, K.; Lu, J. Target speech extraction based on blind source separation and x-vector-based speaker selection trained with data augmentation. arXiv 2020, arXiv:2005.07976. [Google Scholar]
  25. Chen, Z.; Kanda, N.; Wu, J.; Wu, Y.; Wang, X.; Yoshioka, T.; Li, J.; Sivasankaran, S.; Eskimez, S.E. Speech separation with large-scale self-supervised learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023. [Google Scholar] [CrossRef]
  26. Meng, L.; Kang, J.; Cui, M.; Wang, Y.; Wu, X.; Meng, H. A sidecar separator can convert a single-talker speech recognition system to a multi-talker one. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023. [Google Scholar] [CrossRef]
  27. Weng, C.; Yu, D.; Seltzer, M.L.; Droppo, J. Deep neural networks for single-channel multi-talker speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 1670–1679. [Google Scholar] [CrossRef]
  28. Yu, D.; Chang, X.; Qian, Y. Recognizing multi-talker speech with permutation invariant training. In Proceedings of the Interspeech, ISCA, Stockholm, Sweden, 20–24 August 2017; pp. 2456–2460. [Google Scholar] [CrossRef]
  29. Tan, T.; Qian, Y.; Yu, D. Knowledge transfer in permutation invariant training for single-channel multi-talker speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5714–5718. [Google Scholar] [CrossRef]
  30. Huang, Z.; Raj, D.; García, P.; Khudanpur, S. Adapting self-supervised models to multi-talker speech recognition using speaker embeddings. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023. [Google Scholar] [CrossRef]
  31. Qian, Y.; Chang, X.; Yu, D. Single-channel multi-talker speech recognition with permutation invariant training. Speech Commun. 2018, 104, 1–11. [Google Scholar] [CrossRef]
  32. Kanda, N.; Chang, X.; Gaur, Y.; Wang, X.; Meng, Z.; Chen, Z.; Yoshioka, T. Investigation of end-to-end speaker-attributed ASR for continuous multi-talker recordings. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 809–816. [Google Scholar] [CrossRef]
  33. Variani, E.; Lei, X.; McDermott, E.; Moreno, I.L.; Gonzalez-Dominguez, J. Deep neural networks for small footprint text-dependent speaker verification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 4080–4084. [Google Scholar] [CrossRef]
  34. Wang, Q.; Muckenhirn, H.; Wilson, K.; Sridhar, P.; Wu, Z.; Hershey, J.; Saurous, R.A.; Weiss, R.J.; Jia, Y.; Moreno, I.L. VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking. In Proceedings of the Interspeech, ISCA, Graz, Austria, 15–19 September 2019; pp. 2728–2732. [Google Scholar] [CrossRef]
  35. Cui, C.; Sheikh, I.; Sadeghi, M.; Vincent, E. End-to-end multichannel speaker-attributed ASR: Speaker guided decoder and input feature analysis. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Taipei, Taiwan, 16–20 December 2023. [Google Scholar] [CrossRef]
  36. Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar] [CrossRef]
  37. Wichern, G.; Antognini, J.; Flynn, M.; Zhu, L.R.; McQuinn, E.; Crow, D.; Manilow, E.; Roux, J.L. WHAM!: Extending speech separation to noisy environments. In Proceedings of the Interspeech, ISCA, Graz, Austria, 15–19 September 2019; pp. 1368–1372. [Google Scholar] [CrossRef]
  38. Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, K.N.; Hannemann, M.; Motlíček, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi speech recognition toolkit. In Proceedings of the IEEE workshop on Automatic Speech Recognition and Understanding (ASRU), Waikoloa, Hawaii, USA, 11–15 December 2011; pp. 1–4. [Google Scholar]
  39. Ravanelli, M.; Parcollet, T.; Bengio, Y. The pytorch-Kaldi speech recognition toolkit. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6465–6469. [Google Scholar] [CrossRef]
  40. Nagrani, A.; Chung, J.S.; Zisserman, A. VoxCeleb: A large-scale speaker identification dataset. In Proceedings of the Interspeech, ISCA, Stockholm, Sweden, 20–24 August 2017; pp. 2616–2620. [Google Scholar] [CrossRef]
  41. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8026–8037. [Google Scholar]
  42. Watanabe, S.; Hori, T.; Karita, S.; Hayashi, T.; Nishitoba, J.; Unno, Y.; Soplin, N.E.Y.; Heymann, J.; Wiesner, M.; Chen, N.; et al. ESPnet: End-to-end speech processing toolkit. In Proceedings of the Interspeech, ISCA, Hyderabad, India, 2–6 September 2018; pp. 2207–2211. [Google Scholar] [CrossRef]
  43. Karita, S.; Chen, N.; Hayashi, T.; Hori, T.; Inaguma, H.; Jiang, Z.; Someki, M.; Soplin, N.E.Y.; Yamamoto, R.; Wang, X.; et al. A Comparative Study on Transformer vs RNN in Speech Applications. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Singapore, 14–18 December 2019. [Google Scholar] [CrossRef]
  44. Karita, S.; Soplin, N.E.Y.; Watanabe, S.; Delcroix, M.; Ogawa, A.; Nakatani, T. Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. In Proceedings of the Interspeech, ISCA, Graz, Austria, 15–19 September 2019; pp. 1408–1412. [Google Scholar] [CrossRef]
  45. Li, S.; Ouyang, B.; Tong, F.; Liao, D.; Li, L.; Hong, Q. Real-time end-to-end monaural multi-speaker speech recognition. In Proceedings of the Interspeech, ISCA, Brno, Czechia, 30 August–3 September 2021; pp. 3750–3754. [Google Scholar] [CrossRef]
Figure 1. Multi-speaker speech recognition problem illustration. The voices of two independent speakers were recorded by a single microphone, denoted by x . A multi-speaker speech recognition system generates two or more word sequences, denoted by y ( 1 ) and y ( 2 ) , where the parenthesized superscripts are speaker indices from the given audio recordings of overlapped speakers.
Figure 1. Multi-speaker speech recognition problem illustration. The voices of two independent speakers were recorded by a single microphone, denoted by x . A multi-speaker speech recognition system generates two or more word sequences, denoted by y ( 1 ) and y ( 2 ) , where the parenthesized superscripts are speaker indices from the given audio recordings of overlapped speakers.
Applsci 14 08138 g001
Figure 2. Four types of the conventional multi-speaker automatic speech recognition methods. (a) A combination of acoustic source separation and single-input mixed speech and single output text (SISO) ASR; (b) a combination of single-input mixed speech and multiple-output text (SIMO) ASR; (c) the addition of speaker embedding vectors as an additional input to a SIMO ASR; and (d) the addition of an encoder that splits multiple speakers into multiple representations, with encoder outputs as speaker and text embedding vectors that are suited to SISO ASR decoders.
Figure 2. Four types of the conventional multi-speaker automatic speech recognition methods. (a) A combination of acoustic source separation and single-input mixed speech and single output text (SISO) ASR; (b) a combination of single-input mixed speech and multiple-output text (SIMO) ASR; (c) the addition of speaker embedding vectors as an additional input to a SIMO ASR; and (d) the addition of an encoder that splits multiple speakers into multiple representations, with encoder outputs as speaker and text embedding vectors that are suited to SISO ASR decoders.
Applsci 14 08138 g002
Figure 3. Overview of the proposed speaker-attributed training ASR system. The gray blocks were trained by single-speaker recordings, and these were fixed when the white blocks were trained with multi-speaker recordings. The same Enc rec and Dec rec were used with different inputs, and they are represented by the dotted link denoted by *shared. The boxed parts (**) require fine-tuning with multi-speaker utterances.
Figure 3. Overview of the proposed speaker-attributed training ASR system. The gray blocks were trained by single-speaker recordings, and these were fixed when the white blocks were trained with multi-speaker recordings. The same Enc rec and Dec rec were used with different inputs, and they are represented by the dotted link denoted by *shared. The boxed parts (**) require fine-tuning with multi-speaker utterances.
Applsci 14 08138 g003
Figure 4. Overview of the SAT-ASR system when using speaker profiles. The speaker embedding vector q mix was passed through an additional block Attention speaker and then sent to Enc mix . P is a profile matrix composed of the speaker embedding vectors obtained from the training dataset, and β is a set of computed attention weights.
Figure 4. Overview of the SAT-ASR system when using speaker profiles. The speaker embedding vector q mix was passed through an additional block Attention speaker and then sent to Enc mix . P is a profile matrix composed of the speaker embedding vectors obtained from the training dataset, and β is a set of computed attention weights.
Applsci 14 08138 g004
Figure 5. Comparison of the number of profile utterances on the LibriMix dataset by WER.
Figure 5. Comparison of the number of profile utterances on the LibriMix dataset by WER.
Applsci 14 08138 g005
Table 1. Statistics of the evaluation datasets. The number of utterances, number of speakers, and total duration of recordings in hours are provided for each set. The Train set was used for training the models, the Dev set was for hyperparameter tuning, and the Test set was for the final evaluation. There were no overlaps in the samples or the speakers within the dataset. All the samples were mixed recordings of two speakers from Libri2Mix.
Table 1. Statistics of the evaluation datasets. The number of utterances, number of speakers, and total duration of recordings in hours are provided for each set. The Train set was used for training the models, the Dev set was for hyperparameter tuning, and the Test set was for the final evaluation. There were no overlaps in the samples or the speakers within the dataset. All the samples were mixed recordings of two speakers from Libri2Mix.
DatasetNo. UtterancesNo. SpeakersHours
Train64,7001172270
Dev30004011
Test30004011
Table 3. An example of the output word sequences generated by the proposed method. The input is a mixed speech of two speakers saying Label-A and Label-B with temporal overlap, and the proposed SAT-ASR system generates two sequences: Result-1 and Result-2. The word sequences were the token level text generated by a Transformer-based language model [43].
Table 3. An example of the output word sequences generated by the proposed method. The input is a mixed speech of two speakers saying Label-A and Label-B with temporal overlap, and the proposed SAT-ASR system generates two sequences: Result-1 and Result-2. The word sequences were the token level text generated by a Transformer-based language model [43].
Token Level Text
Label-ASHE WAS YOUNG UN TRI ED NERVOUS IT WAS A VISION OF SERIOUS DUTIES AND LITTLE COMPANY OF REALLY GREAT LO NE LINESS
Result-1SHE WAS YOUNG UN TRI ED NERVOUS IT WAS A VISION OF SERIOUS DUTIES AND LITTLE COMPANY BUT REALLY GREAT LO NE LINESS <eos>
Label-BHE HOPED THERE WOULD BE STE W FOR DINNER TURN IP S AND CARR OT S AND B RU IS ED POT A TO ES AND FAT MU T TON PIECES TO BE LAD LED OUT IN THICK PEPPER ED FLOUR FAT TEN ED SAUCE
Result-2HE HOPED THERE WOULD BE STE W FOR DINNER TURN IP S AND CARR OT S AND B RU IS ED POT A TO ES AND FAT MU T TON PIECES TO BE LAD LED OUT IN THICK PEPPER ED FLOUR FAT AND SAUCE <eos>
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, M.; Jang, G.-J. Speaker-Attributed Training for Multi-Speaker Speech Recognition Using Multi-Stage Encoders and Attention-Weighted Speaker Embedding. Appl. Sci. 2024, 14, 8138. https://doi.org/10.3390/app14188138

AMA Style

Kim M, Jang G-J. Speaker-Attributed Training for Multi-Speaker Speech Recognition Using Multi-Stage Encoders and Attention-Weighted Speaker Embedding. Applied Sciences. 2024; 14(18):8138. https://doi.org/10.3390/app14188138

Chicago/Turabian Style

Kim, Minsoo, and Gil-Jin Jang. 2024. "Speaker-Attributed Training for Multi-Speaker Speech Recognition Using Multi-Stage Encoders and Attention-Weighted Speaker Embedding" Applied Sciences 14, no. 18: 8138. https://doi.org/10.3390/app14188138

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop