1. Introduction
Speech extraction refers to the extraction of individual signals from mixed signals, which was first proposed to address the cocktail party problem [
1,
2]. Interference in speech can decrease the quality of information being communicated. In addition, interference in the speech can severely affect other related tasks, such as automatic speech recognition (ASR). Current speech recognition technology can accurately recognize an individual speaker, but when there are two or more speakers, the accuracy of speech recognition is greatly reduced. Thus, speech extraction has become an important factor in obtaining speech with better quality and intelligibility.
Many studies have been carried out on the speech extraction problem. The initial attempt was based on traditional methods. For example, methods based on signal processing estimate the power spectrogram of noise or ideal Wiener filters from the perspective of signal processing, e.g., spectral subtraction [
3] and Wiener filter [
4,
5]. Additionally, another algorithm based on decomposition is represented as follows:
where X is the spectrogram of a signal and is decomposed into the matrix product of a base matrix W and activation matrix H. Non-negative matrix factorization (NMF) lets W and H be non-negative to obtain non-negative matrix factorization [
6,
7], which can be used to obtain the basic spectral patterns of non-negative data. Another method is called computational auditory scene analysis (CASA) [
8], which uses auditory grouping cues. The CASA method [
9] models auditory signals and utilizes the similarities between the fundamental frequencies of speech signals. A method based on Bayesian inference rules was proposed in a study by Barniv [
10]. This method describes the formation process of pure tone sequencing as “auditory streaming” and estimates auditory sequencing by processing prior probability using the Bayesian criterion. Another method based on neural computation represents auditory flows in terms of units of neurons and competition between auditory flows is realized by inhibitory connections between neurons. Wang [
11] developed a method that utilizes local and global inhibitory mechanisms to separate auditory flows. Another method by Mill [
12] is based on time coherence, which uses a prediction mechanism to promote competition among different groups. The methods mentioned above extract speaker by establishing mathematical models, so their performances are not applicable in more complex situations. Additionally, the accuracy of the extracted speech using traditional methods is far below that of the deep learning methods.
In recent years, deep learning methods have achieved good results in speech extraction under challenging conditions such as non-stationary interference. There have been many studies on speech extraction using deep learning methods. However, most of the previous studies have relied on the use of clean speech from the target speaker and they have achieved their targets by training networks on data from the target speaker, thus generating models to extract particular targets. The models are trained on fixed speaker pairs or target speakers. These models also rely on the assumptions that the amount of the data being put into the network is sufficient and substantial, and that speakers without substantial data cannot be extracted. However, clean speech from particular target speakers cannot always be obtained, and sometimes only a few of the utterances from the target speaker can be recorded in a conversation. So, we tried to find a method that could solve the problem of needing substantial clean speech from the target speaker.
Some methods have been already proposed to solve this problem. The SpeakerBeam network [
13] was proposed to solve the speech extraction problem by training speaker-independent models that are informed by additional speaker information rather than creating particular models for target speakers. However, this method relies on data provided by the additional speaker information and the additional information can only be obtained from conversations without any speech overlaps or the personal devices, which is inadequate for training.
To solve the problem of needing additional speech records, we explored some new approaches to speech extraction and investigated EGG signals. These kind of signals come from the vibrations in human throats, and can be recorded without interference from other noises. In addition, the EGG signals of particular speakers can be recorded during conversations in any situation, which increases the amount of additional target speaker information. By utilizing EGG signals from a target speaker, features can be extracted from the signals and applied to deep learning methods for speech extraction. Because EGG signals provide information about particular speakers, designated speech can be extracted from a conversation. Since EGG signals can be obtained in any situation during the process of speaking, they can also be used for real-time speech extraction.
This paper is organized as follows: In
Section 2, we examine studies related to speech extraction and present our work on the topic.
Section 3 introduces the materials and the methods used in our study and illustrates our model in detail. In
Section 4, we compare the results from our network to those from previous studies using different datasets and under different signal-to-noise ratios (SNRs). In
Section 5, we discuss our work and findings. Finally,
Section 6 presents our conclusion and introduces the future directions of our study.
2. Related Works
Speech extraction algorithms can be divided into single-channel speech extraction algorithms and multi-channel speech extraction algorithms [
14,
15] according to the number of microphones that are used to record the speakers. Single-channel speech extraction is usually solved with time-domain or frequency-domain methods while multi-channel speech extraction is solved using the methods for extracting coherent signals from different speakers. In our work, we mainly focus on the problems of single-channel speech extraction.
With the rapid development of machine learning and artificial intelligence, the performance of deep-learning-based audio signal processing algorithms has been further improved. Speech extraction technology has also advanced thanks to the development of deep learning [
16,
17,
18,
19,
20,
21,
22,
23,
24]. In most cases, target speech is extracted in the frequency domain. In these methods, networks obtain spectrograms of the target speech using short-time Fourier transforms(STFTs) and then generate spectrograms of the estimated speech. Other methods are based on the time domain, which mainly work by extracting the time domain features of target speech and can solve the problem of phase mismatch by using adaptive front-end and direct regression instead of STFTs.
Deep clustering (DC) is a frequency-domain method that was proposed by Hershey [
25] in 2016. In this method, the amplitude spectrogram features of (T, F) dimension mixed speech are mapped into a higher dimension (T, F, D) deep embedded feature space, i.e., each time-frequency unit (T, F) is mapped into a D-dimensional feature vector, which makes the mixed input features more distinguishable. The target of this method is to generate binary masks, which allocate the areas belonging to the target speech with a mask of 1 and the areas belonging to the other speech with a mask of 0. By multiplying the binary masks by the spectrograms of mixed speech, the network can cover the areas of noise on the spectrograms of mixed speech and obtain target speech from the mixed speech.
Another technique is permutation invariant training(PIT). In 2017, Yu [
26] proposed the PIT method and applied it to the speech extraction task. This method selects the smallest mean square error (MSE) as the optimization target and effectively solves the problem of the permutation of the target and interference to find the best match for the desired target compared to DC. In 2021, Yousefi [
27] combined the traditional PIT method with long short-term memory (LSTM) and managed to improve the algorithm efficiency. The algorithm has a probabilistic optimization framework and solves the problem of the low efficiency of PIT by finding the best output label allocation. This method is significantly superior to traditional speech extraction methods that use the signal-to-distortion ratio (SDR) and source-to-interference ratio (SIR). In conclusion, the PIT algorithm provides a good training criterion for speaker-independent speech extraction to deal with the permutation and combination problems.
In summary, the frequency-domain methods depend on the consistency of the outputs. These methods have multiple outputs, which make it difficult to define the target outputs of the network. Moreover, inverse short-time Fourier transforms (ISTFTs) with enhanced amplitude spectrogram and original mixed-phase spectrograms have certain impacts on speech extraction performance.
To solve the problems with the frequency-domain methods, the time-domain methods are used. Conv-TasNet is one of the most common methods. In 2019, Luo [
28] proposed the convolution time-domain audio separation network (Conv-TasNet), which is superior to several time-frequency amplitude masks for dual-speaker speech extraction. This method build learnable front ends instead of STFTs, thereby generating features that are similar to those of a spectrogram.
In 2021, Li [
29] proposed the dual-path recurrent neural network (DPRNN), which breaks long audio clips into smaller chunks to optimize the recurrent neural network (RNN) in a deep model. The DPRNN significantly minimizes the model size compared to the time-domain audio separation network (TasNet) and enhances speech extraction performance.
Although the studies mentioned above have obtained excellent results in dealing with speech extraction problems, the networks tend to be complicated and lack portability. Time-domain speech extraction methods can achieve good extraction effects, but they are calculated point by point, and the time-domain models tend to be complex and require expensive computation costs.
SpeakerBeam and VoiceFilter [
30] are examples of target speech extraction models that can be used in both the frequency domain and time domain using extra information from target speakers. SpeakerBeam uses a sequence summary network to generate spectrograms containing the features of the target speaker’s speech while VoiceFilter concatenates spectrogram features and d-vector features, which are extracted from the last hidden layer of the deep neural network, to estimate the clean speech of target speakers. In 2018, Žmolíková [
31] optimized this method using ASR technology and utilized predicted hidden Markov model (HMM)-state posteriors to improve the masks. In 2019, Žmolíková [
32] also refined SpeakerBeam to train models using extra information about target speaker instead of training particular models for target speakers. This network utilizes information about target speakers from adapted speech [
33], both for single-channel and multi-channel speech extraction and achieves better performance than former networks. SpeakerBeam has also been modified with an attention mechanism [
34] to extract features from the additional information, which effectively improves the performance of the multimodal SpeakerBeam network [
35]. Additionally, Delcroix [
36] proposed an implementation of SpeakerBeam in the time-domain, with auxiliary speaker information added to the network. However, all of the speech extraction algorithms mentioned above are based on the premise that all mixed speech can be separated to obtain clean speech from a target speaker, but it is difficult to achieve this in practical application.
Therefore, we focused on finding other signals that could provide information about the target speech and eventually identified EGGs [
37,
38,
39,
40], which were invented by Adrian Fourcin [
41]. EGGs are a kind of skin electrical signal that measure vocal cord vibrations during laryngeal vocalization. The acquisition of EGG signals is not susceptible to other noise or vibrations. Thanks to method of collecting EGG signals directly from people’s throat, they can be obtained effectively in extremely noisy environments. In 2020, Bous [
42] utilized a deep neural network and EGGs to estimate glottal glosure instants(GCI), and optimized the method by using the analysis synthesis settings of real speech signals, which improved the final performance for glottal closure instants. In 2020, Cangi [
43] proposed a measurement to test the reliability of EGGs, which shows the differences in the values of vowels between different genders or individuals. This method illustrates the possibility of EGGs being used in speech processing. The features of the EGG signal extraction module were proposed based on LSTM units [
44] to replace SpeakerBeam’s feature extraction network. This method works through voiced segment extraction, feature extraction, and F0 smoothing and achieves a 91.2% accuracy in the classification of EGGs. In 2022, Chen [
45] proposed a cross-modal emotion distillation model that uses fundamental frequencies from EGG signals to improve the emotion recognition accuracy of emotional databases. Since previous studies have proved that EGGs can improve performance for other acoustic tasks, we applied EGGs to speech extraction tasks to verify whether EGGs can improve speech extraction performance.
In our work, considering that EGG signals are not susceptible to other noises, we proposed a network based on EGGs to extract target speakers from mixed speech. This method differs from previous speech extraction methods that require clean speech from the target speaker as it only needs to collect EGG signals when recording speech, which simplifies the speech extraction procedure. As seen in the waveforms of EGG signals and speech signals, sound segments and silent segments are both relevant. In addition, to utilize the time domain features of EGG signals, we proposed a method to process mixed signals using information provided by the EGG signals.
5. Discussion
In
Section 4, we detailed the series of speech extraction experiments that we conducted to compare speech extraction performance and identify the best method. To explore the effects of information extraction using different signals, we compared the SpeakerBeam network and our EGG_Aux and Pre_EGG_Aux networks. The results showed that on the CDESD, the SDRi increased by 1.12 dB while the SISDRi increased by 0.86 dB with the EGG_Aux network. When using the Pre_EGG_Aux network, the SDRi increased by 1.15 dB while the SISDRi increased by 0.89 dB. From these results, we could infer that the EGG signals provide more information than speech signals and could extract more information from time-domain features. In addition, we tested our networks on the EMO-DB. These results showed that using the Pre_EGG_Aux network increased SDRi by 1.41 dB and the SISDRi by 1.72 dB, meaning that our proposed method had a better speech extraction performance for different languages.
As for different speech extraction circumstances, we conducted experiments involving different genders. The EGG method achieved a better performance in most situations, especially when the target speaker and interference speaker were of the same gender. In the female-female situation, the network using EGGs achieved a 2 dB increase, while a 1 dB increase was achieved in the male-male situation. In terms of speech extraction performance involving speakers of the same gender, the Pre_EGG_Aux network achieved a similar level to that involving different genders. As shown above, when dealing with female speech extraction, the network using EGG signals achieved better results, which meant that the EGG signals from female speaker were easier to recognize than those of male speakers.
To verify the performance of our model under different SNRs, we mixed samples from the datasets using different amplitude spectrograms and compared the results for SNRs ranging from −5 dB to 5 dB. As shown in the Results Section, the SISDRi was 8.70 dB when the SNR was set as −5 dB, while the SISDRi was 4.97 dB when the SNR was set as 5 dB. These results suggested that our model had a better speech extraction performance in noisier environments.