An Efficient Multistage Approach for Blind Source Separation of Noisy Convolutive Speech Mixture

Khan, Junaid Bahadar; Jan, Tariqullah; Khalil, Ruhul Amin; Saeed, Nasir; Almutiry, Muhannad

doi:10.3390/app11135968

Open AccessArticle

An Efficient Multistage Approach for Blind Source Separation of Noisy Convolutive Speech Mixture

by

Junaid Bahadar Khan

¹

,

Tariqullah Jan

¹,

Ruhul Amin Khalil

¹

,

Nasir Saeed

^2,*

and

Muhannad Almutiry

³

¹

Department of Electrical Engineering, University of Engineering and Technology, Peshawar 25120, Pakistan

²

Department of Electrical Engineering, National University of Technology (NUTECH), Islamabad 44000, Pakistan

³

Department of Electrical Engineering, Northern Border University, Arar 73222, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2021, 11(13), 5968; https://doi.org/10.3390/app11135968

Submission received: 21 May 2021 / Revised: 22 June 2021 / Accepted: 24 June 2021 / Published: 27 June 2021

(This article belongs to the Special Issue Advance in Digital Signal Processing and Its Implementation)

Download

Browse Figures

Versions Notes

Abstract

:

This paper proposes a novel efficient multistage algorithm to extract source speech signals from a noisy convolutive mixture. The proposed approach comprises two stages named Blind Source Separation (BSS) and de-noising. A hybrid source prior model separates the source signals from the noisy reverberant mixture in the BSS stage. Moreover, we model the low- and high-energy components by generalized multivariate Gaussian and super-Gaussian models, respectively. We use Minimum Mean Square Error (MMSE) to reduce noise in the noisy convolutive mixture signal in the de-noising stage. Furthermore, the two proposed models investigate the performance gain. In the first model, the speech signal is separated from the observed noisy convolutive mixture in the BSS stage, followed by suppression of noise in the estimated source signals in the de-noising module. In the second approach, the noise is reduced using the MMSE filtering technique in the received noisy convolutive mixture at the de-noising stage, followed by separation of source signals from the de-noised reverberant mixture at the BSS stage. We evaluate the performance of the proposed scheme in terms of signal-to-distortion ratio (SDR) with respect to other well-known multistage BSS methods. The results show the superior performance of the proposed algorithm over the other state-of-the-art methods.

Keywords:

blind source separation (BSS); minimum mean square error (MMSE); convolutive mixture; source prior; generalized Gaussian distribution

1. Introduction

In a noisy real-time environment, the performance efficiency of Blind Source Separation (BSS) applications is degraded by background noise and interfering signals. The classical methods used for speech enhancement have reached their saturation level in terms of enhancement and performance. The estimation of the desired source signal from a mixture with noise, especially for non-stationary noisy conditions, is a bottleneck for these techniques. Therefore, the BSS applications require a solution that can suppress the noise according to the nature of the environment.

The speech signal enhancement problem has been well-studied in recent decades. Different solutions are provided to enhance the intelligibility and quality of the speech signals and improve the performance of the BSS systems. The classical techniques overcome this problem by using adaptive techniques such as minimum means square error (MMSE) [1,2,3,4,5,6,7]. The MMSE adjusts itself according to the observed convolutive mixture. Another solution uses statistical models which accurately diagonalize the second-order statistical properties of the noisy reverberant mixture. This approach uses an auto-correlation covariance matrix and its one-sample delayed matrix, forming two positive definite symmetry matrices. Then, the matrices’ diagonalization can be exploited accurately by computing a generalized singular value decomposition (GSVD) using the tangent algorithm [8].

The BSS methods extract the desired source speech signal from the convolutive mixture in the presence of noise. Various BSS methods such as Independent Component Analysis (ICA) and FASTICA extract the source speech signals in a noisy reverberant environment. First, the ICA de-noises the noisy reverberant mixture followed by the FASTICA algorithm to separate the de-noised estimated speech signal from the observed convolutive mixture [9]. Additionally, in an undermined scenario, ICA is combined with a speech recognition system (SRS) to extract the desired targeted speech signal [10]. However, these methods require prior knowledge of the mixing process and a number of source signals. Therefore, the advantage of our proposed BSS methods is that it separates the targeted source speech signal from the reverberant mixture without prior knowledge of the mixing process nor the number of source signals. This gives our approach an additional efficiency over existing speech processing methods that require prior knowledge or training. Unlike the existing models, we have used multivariate generalized Gaussian and super-Gaussian source priors as a hybrid source prior model. In this hybrid model, the generalized Gaussian source prior exploits higher-order statistical properties while multivariate super-Gaussian models use other related information.

The main problem encountered by BSS techniques is the permutation and scaling ambiguities after the speech separation process. Therefore, in [11], the authors proposed a solution that can easily recognize the desired source speech signal in a noisy environment by looking at the speaker’s face. An audiovisual coherence is used to estimate the speech signals using statistical methods where statistical tools model the audio and visual information in the frequency domain (FD).

Furthermore, in multiple audio sources with scenarios with multiple microphones, the performance of the BSS separation process is improved by using the BSS output to generate the Wiener filter coefficients and applying them to the desired speech signals [12]. Moreover, adaptive filtering with BSS can also reduce the noise, leading to speech enhancement and noise reduction. Forward Blind Source Separation (FBSS) combined with the Simplified Fast transversal filter (SFTF) method results in adoption gain from forwarding prediction [13]. Nevertheless, adaptive filtering methods face problems while canceling or suppressing the acoustic noise. This issue is tackled using the Modified Predator–prey particle swarm optimization (MPPPSO) approach. It also solves the problem of steady-state error of PPPSO for non-stationary inputs and a large filter length [14].The acoustic noise can also be suppressed by introducing variable step size in a two-channel sub-band forward algorithm (2CSF) that improves the convergence speed and overcomes the fixed step size problem in the traditional 2CSF method [15]. Another approach using variable step sizes is adaptive blind source separation through a two-channel forward–backward structure based on the normalized least-mean square (NLMS) method that uses variable step sizes for steady-state conditions [16]. The estimated source signal enhancement in the presence of acoustic noise is performed by Threshold Wavelet-based Forward Blind Source Separation (TWFBSS). This approach reduces the computational complexity from the Wavelet-based Forward Blind Source Separation (WFBSS) method [17].

Kalman filters can also be used with BSS techniques to deal with the noisy convolutive mixture. First, the BSS approach extracts the estimated source speech signal from the non-stationary noisy reverberant mixture. Then, Kalman filtering suppresses the noise components in the estimated speech signal [18]. Recently, new evolving techniques such as deep learning are also applied with the BSS approach in the reverberant noisy environment [19]. In general, the BSS methods are tested under non-Gaussian noise modeled by the fourth-order cumulant, and the singular value decomposition-total least square method [20]. Moreover, the speech signals are often corrupted by different types of noise produced in the surrounding environment that can be tackled by the Dual Recursive non-Quadratic (DRNQ) adaptive method combined with FBSS to enhance the speech quality [21].

1.1. Background

The BSS methods estimate the desired source speech signals from the observed convolutive mixture containing noise. However, accurate identification of the targeted speech signal in a noisy reverberant environment is the fundamental goal of the speech processing systems. The traditional BSS methods are limited to multiple speech signals and sensors, where the de-noising process is challenging. Nevertheless, various signal processing methods, such as Single-channel Blind Source Separation (SBSS), Sparse Component Analysis (SCA), and Variation Mode Decomposition (VMD), can tackle this issue. The VMD method is applied to decompose a single channel into two channels, and then SCA separates the speech signals. This approach shows enhancement of speech signal in under-determined conditions [22].

Another approach, AdaGrade, is proposed in [23] for blind audio speech extraction that uses the gradient-based algorithm. The gradient learning rule is modified by pre-conditioning the input signal and using the AdaGrade update. In this method, the natural gradient method with two-step pre-processing suppresses the noise in the receiving reverberant mixture. First, the bias removal method followed by the least-square method is applied to de-noise the noisy convolutive mixture. Then, a joint algorithm with a gradient method estimates the noisy signals and the mixing matrix [24]. Moreover, in [25], the BSS involves Eigen filtering, which receives the dominant frequencies of the signal, and then Wavelet de-noising is applied. It suppresses the noise components and retains the speech signal regardless of its frequency components. The authors of [26] propose an alternate method based on temporal predictability to obtain the individual independent noise signal where a non-negative matrix factorization algorithm enhances the speech signal. The performance is improved by adding time-correlation to the objective function, which restricts the time-varying gain of the noise [27]. Moreover, masking techniques can also be applied to separate the desired speech signal from the received mixture, where the time-frequency masking rule can define the BSS method [28]. In [29], the authors propose an EM algorithm to suppress the noise in the convolutive mixture for the complex-Gaussian signal model and the unknown deterministic model. The statistical model is defined for both models, and the EM algorithm is developed for these models to estimate the speech signal and its acoustic parameters.

Recently, unsupervised speech enhancement algorithms are gaining interest that use a Real-Time (RT) two-channel BSS algorithm. In this method, a non-negative matrix factorization (NMF) dictionary is combined with a generalized cross-correlation (GCC) spatial localization approach. The RT-GCC-NMF operates in a frame-by-frame manner, comparing individual dictionary atoms with the desired speech signal or interfering noise based on the time-delay arrivals [30].

1.2. Contributions

The BSS approach separation gain depends on the selection of an appropriate source prior function for extracting the desired speech signals [31,32]. For example, [33] proposes a mixed source prior model comprised of super-Gaussian and Student’s T to enhance the performance of the BSS. Consequently, in [34], the performance is improved by using a hybrid model, consisting of multivariate super-Gaussian and generalized Gaussian source priors. This approach models the higher amplitudes of the observed convolutive mixture by a multivariate generalized Gaussian source prior, and the low amplitudes are exploited by a multivariate Gaussian source prior. Unlike these existing works, we propose an efficient multistage BSS method. In this method, multivariate generalized Gaussian and super-Gaussian source priors are combined as a hybrid source prior model. The generalized Gaussian one exploits higher-order statistical properties while other related information is modeled by the multivariate super-Gaussian one. The contributions of this research work are as follows:

We propose a novel efficient multistage approach for BSS applications. This method concatenates the hybrid approach. Our proposed hybrid models combine multivariate generalized Gaussian and super-Gaussian source priors.
Based on the hybrid model, two different schemes are introduced, i.e., first BSS followed by de-noising and second de-noising in the first stage followed by BSS.
The performance of the proposed multistage hybrid model is evaluated with other multistage BSS methods having single source priors.
The performance of the proposed models are investigated via extensive simulations in a noisy reverberant environment.

1.3. Organization

The article is organized into the following sections. Section 2 describes the hybrid source prior signal model for the Independent Vector Analysis (IVA). Section 3 provides a detailed description of the proposed multistage approach for speech enhancement, followed by the results and discussion in Section 4. In Section 5, we evaluate the performance of the proposed multistage model. Section 6 presents the conclusion and future works.

2. Signal Model

Consider a clean source speech signal

x (t)

, noise signal

n (t)

, mixing matrix A, and received speech signal

y (t)

contaminated by noise. It can be mathematically modeled as

\begin{matrix} y (t) = A x (t) + n (t) . \end{matrix}

(1)

The clean speech source signal, noise signal and the received noisy speech signal are transformed to the FD domain and these parameters are denoted by

X (k)

,

N (k)

, and

Y (k)

, respectively, while k denotes the position index of the coefficient in the transformed domain. The design criteria of the estimator for the observation are to minimize the MSE given by

\begin{matrix} E {X (k) - \hat{X} (k)}, \end{matrix}

(2)

where

E {\cdot}

is the expectation operator and

\hat{X} (k)

is the estimated source signal. Minimum Mean Square Error (MMSE) filter can be used to minimize the mean square error (MSE) in (2).

In a given noisy observation

{y (t); 0 \leq t \leq T}

with received signal

Y (K)

. The estimated

\hat{X} (k)

can be obtained by [35],

\begin{matrix} \hat{X} (k) = E {X (k) / Y (k)} . \end{matrix}

(3)

Equation (3) can be rewritten by Baye’s theorem [35,36,37],

\begin{matrix} \hat{X} (k) = \frac{\int_{- \infty}^{\infty} a_{k} p (Y (k) / a_{k}) p (a_{k}) d a_{k}}{\int_{- \infty}^{\infty} p (Y (k) / a_{k}) p (a_{k}) d a_{k}} \end{matrix}

(4)

where

p (.)

is the probability density function (pdf) and

a_{k}

denotes the dummy variable representing all possible values of

X (k)

. Assuming a Gaussian distribution model, then

p (Y (k) / a_{k})

and

p (a_{k})

can mathematically written as

\begin{matrix} p (Y (k) / a_{k}) = \frac{1}{\sqrt{2 π λ_{n} (k))}} exp (- \frac{{(Y (k) - a_{k})}^{2}}{2 λ_{n} (k)}) \end{matrix}

(5)

and

\begin{matrix} p (a_{k}) = \frac{1}{\sqrt{2 π λ_{x} (k)}} exp (- \frac{a_{k}^{2}}{2 λ_{x} (k)}) \end{matrix}

(6)

where

λ_{n} (k) = {E {| N (k) |}^{2}}

and

λ_{x} (k) = {E {| X (k) |}^{2}}

are the variances of the noisy signal and clean signal, respectively. Putting (5) and (6) in (4), then

\hat{X} (k)

can be rewritten as [35,36],

\begin{matrix} \hat{X} (k) = \frac{ξ (k)}{ξ (k) + 1} Y (k) \end{matrix}

(7)

where

ξ (k)

is the a priori SNR and

ξ (k) = \frac{λ_{x} (k)}{λ_{n} (k)}

. The values of

λ_{x}

and

λ_{n}

must be known. [38,39] shows the detailed method for estimating

λ_{x}

. A decision directed estimating method was developed to estimate

λ_{x}

[36]. The equation of estimating

{\hat{λ}}_{x}

for

λ_{x}

is given by [35]

\begin{matrix} {\hat{λ}}_{x} = α {\hat{λ}}_{x} {(k)}_{p} + (1 - α) max (Y {(k)}^{2} - λ_{x} (k), 0) \end{matrix}

(8)

where

max (.)

is the maximum function. It is used to obtain non-negative values.

{\hat{λ}}_{x} {(k)}_{p}

is the estimated value of

λ_{x}

of the previous frame.

α

is the constant tuned for the best results. The value of parameter

λ

is set to

0.98

. If

λ

is set to 1, it deteriorates the speech signal and smaller values result in high musical noise.

3. Proposed Multistage BSS Approach

This section presents the proposed multistage approach for BSS and speech enhancement in a noisy reverberant environment. The multistage method comprises the BSS stage and de-noising stage using MMSE filtering as shown in Figure 1 and Figure 2, respectively. The proposed scheme evaluates different combinations of the BSS hybrid model and the de-noising MMSE method. In the first model (Figure 1), the observed convolutive mixture speech signal is first processed by the BSS stage with a hybrid source prior model for the extraction of estimated speech signals from the reverberant mixture. The de-noising module processes the resultant noisy extracted speech signals where the noisy elements in the separated speech signals are suppressed to improve the quality of the estimated signals. In the second model (Figure 2), the received reverberant observed speech mixture is de-noised by the MMSE filtering method in the first stage. In the second stage, the enhanced convolutive speech mixture is processed by the BSS stage with a hybrid source prior model to extract the de-noised estimated source speech signal from the enhanced reverberant mixture.

Multivariate generalized Gaussian and super-Gaussian source priors are combined into the hybrid source prior model in the BSS stage. The generalized Gaussian model exploits higher-order statistical properties while multivariate super-Gaussian models use other related information in the hybrid source prior model approach. The weights of the source priors in the hybrid model are adopted following the energy components of the received convolutive mixture [34]. In the de-noising stage, the MMSE filtering method is used to suppress the noisy component in the received convolutive mixture signal.

The hybrid source prior model provides a better separation performance and preserves the frequency dependencies between different frequency blocks for the IVA algorithm. Instead of using a single source prior distribution, a combination of multivariate generalized Gaussian and super-Gaussian models is used for source priors for the IVA to preserve the frequency dependencies. By using the Kullback–Leibler (KL) divergence cost function to preserve the dependencies within the source speech signal while removing the dependencies among different source signals [40]. Mathematically the non-linear cost function for the hybrid model can be written as [31]

\begin{matrix} C = K L (P ({\hat{s}}_{1}, \dots, {\hat{s}}_{N}) ‖ \prod_{i = 1}^{N} q ({\hat{s}}_{i})) \\ = const - \sum_{k = 1}^{K} log |\det (W (k))| - \sum_{i = 1}^{N} E log q ({\hat{s}}_{i}), \end{matrix}

(9)

where

W (k)

is the k-th separating matrix and

q ({\hat{s}}_{i})

is the source prior of i-th estimated source signal. The multivariate cost function in (9) is minimized by the gradient descent algorithm to remove the dependencies among different source signals and mathematically can be expressed as [31]

\begin{matrix} Δ w_{i j} (k) & = & - \frac{\partial C}{\partial w_{i j} (k)} \\ = & \sum_{l = 1}^{N} (I_{i l} - E φ^{k} ({\hat{s}}_{i}^{(1)}, \dots, {\hat{s}}_{i}^{(K)}) {\hat{s}}_{l}^{(k)}) w_{l j}^{(k)}, \end{matrix}

(10)

where I is the identity matrix and

φ^{(k)} (.)

is the non-linear score function which can be mathematically expressed as [34]

\begin{matrix} φ^{k} ({\hat{s}}_{i}^{(1)}, \dots, {\hat{s}}_{i}^{(K)}) = - \frac{\partial log q ({\hat{s}}_{i}^{(1)}, \dots, {\hat{s}}_{i}^{(K)})}{\partial {\hat{s}}_{i}^{k}} \end{matrix}

(11)

The non-linear score function retains the dependency between different frequency bins, which is the main theme of the IVA algorithm and plays a vital role in the separation process. Fundamentally, the IVA method [31] uses a multivariate super-Gaussian distribution source prior to model the different frequency bin inter-frequency dependencies, which are expressed as,

\begin{matrix} q (s_{i}) \propto exp (- \sqrt{\sum_{k = 1}^{K} {|\frac{{\hat{s}}_{i} (k)}{σ_{i} (k)}|}^{2}}), \end{matrix}

(12)

where

σ_{i} (k)

represents the standard deviation of i-th source at k-th frequency block. Using Equation (11) to determine the score function of Equation (12), we obtain

\begin{matrix} φ^{(k)} ({\hat{s}}_{i} (1), \dots, {\hat{s}}_{i} (K)) = \frac{{\hat{s}}_{i} (k)}{\sqrt{\sum_{k = 1}^{K} {| {\hat{s}}_{i} (k) |}^{2}}} . \end{matrix}

(13)

Equation (13) shows the non-linear score function of the fundamental IVA algorithm and is used for inter-frequency dependencies between source signals. However, the non-linear score function is not unique and is strongly dependent on the source prior. Therefore, we can use different source priors to exploit higher-order statistics. The generalized multivariate Gaussian can also be used as a source prior distribution to retain inter-frequency dependencies between different frequency blocks. Due to its heavy tails, it exploits higher-order statistical properties between the source signal and can be expressed as [32]

\begin{matrix} q (s_{i}) \propto exp (- \sqrt[3]{{(s_{i} - μ_{i})}^{†} Σ_{i}^{- 1} (s_{i} - μ_{i})}) . \end{matrix}

(14)

We assume the mean

μ_{i} = 0

and covariance

Σ_{i}

equal to identity. Then, using Equation (11) for Equation (14), the score function will be

\begin{matrix} φ^{(k)} ({\hat{s}}_{i} (1), \dots, {\hat{s}}_{i} (K)) = \frac{2 {\hat{s}}_{i} (k)}{3 \sqrt[3]{(\sum_{k = 1}^{K} | {\hat{s}}_{i} (k) {|^{2})}^{2}}} . \end{matrix}

(15)

In a noisy real-time environment, the non-stationary nature of the observed convolutive mixture contains high- as well as low-energy components. Hence, it is difficult for a single source prior to model the statistical properties of a non-stationary convolutive mixture. Therefore, a hybrid model is proposed containing multivariate generalized Gaussian and super-Gaussian source priors. The hybrid source prior model can better model low and higher amplitudes [34]. The super-Gaussian source prior function models low-energy amplitude, and the high-energy amplitude is modeled by a multivariate generalized Gaussian source prior. The weights between these source priors in the hybrid source prior model are adopted based on the energy of a noisy convolutive mixture. The hybrid model can be expressed as

q (s_{i}) = {\begin{cases} f_{G G D} & if ϕ \geq 0.5 \\ f_{S G D} & if ϕ < 0.5 \end{cases} .

(16)

f_{G G D}

is the multivariate generalized Gaussian source prior distribution and

f_{S G D}

is the super-Gaussian source prior distribution. The non-linear hybrid score function is mathematically written as

φ^{(k)} ({\hat{s}}_{i} (1), \dots, {\hat{s}}_{i} (K)) = {\begin{cases} (\frac{2 {\hat{s}}_{i} (k)}{3 \sqrt[3]{(\sum_{k = 1}^{K} | {\hat{s}}_{i} (k) {|^{2})}^{2}}}) & ; ϕ \geq 0.5 \\ (\frac{{\hat{s}}_{i} (k)}{\sqrt{\sum_{k = 1}^{K} {| {\hat{s}}_{i} (k) |}^{2}}}) & ; ϕ < 0.5 \end{cases}

(17)

where

ϕ = [0, 1]

is the weighting parameter, which depends on the normalized energy of the received noisy convolutive mixture. The weights of the non-linear score functions and

ϕ

are adjusted by the normalized energy of the mixture at every frequency block.

4. Results and Discussion

This section provides the performance evaluation of the proposed work using the Matlab simulation tool. First, we use real speech signals from the TIMIT database [41] to generate noisy convolutive mixed signals using a simulated room model and then apply the proposed multistage algorithm to see its effectiveness.

4.1. Experimental Setup

We considered 10 source speech signals comprised of five female and five male speakers from the TIMIT database [41]. All the source speech signals have the same loudness and a sampling rate of 8 kHz. The Hamming window with a

75 %

overlapping factor was used. A noisy reverberant environment was used to evaluate the separation performance of the proposed multistage approach. For a fair comparison, the methods used in [31,32,42] were extended, composed of the respective BSS technique and MMSE filtering mechanism for de-noising, as presented in Figure 1 and Figure 2, respectively. The proposed models were investigated for different parameters such as signal-to-distortion ratio (SDR) and RT.

Δ

SDR is defined as the difference between the desired SDR of the estimated speech signal and SDR of the speech mixtures, i.e.,

Δ

SDR = SDR

_{desired}

− SDR

_{mixture}

.

4.2. Objective Evaluation

For the first proposed model shown in Figure 1, the SNR values were varied from

- 2

to 10 dB. The NFFT, window size, and RT are considered 1024, 512, and 100 msec, respectively. The obtained results of different input speech mixtures were averaged and the results are provided in Table 1. Table 1 shows that the proposed model shows performance improvement as compared to BSS methods with multivariate super-Gaussian [31], multivariate Student’s T [42], and generalized Gaussian [32] source priors for estimated speech signals

S_{1}

and

S_{2}

. Next, the RT parameter was varied from 40 to 200 msec. The window size and NFFT were set to 512 and 1024, respectively. In Table 1, the proposed model shows better results on SNR = 4 dB in comparison to the rest of the SNR values. Therefore, SNR = 4 dB was considered for the RT experiments. From Table 2, it was also concluded that the proposed model shows improvement in comparison to [31,32,42]. The first proposed model shows better performance due to its adaptability according to the non-stationary nature of the observed convolutive mixture as it contains low- and high-energy components compared to the single source prior BSS models.

For the second proposed model reflected in (Figure 2), the same procedure is followed to generate different convolutive speech mixtures having two speech signals and white Gaussian noise by a simulated room model [43]. The SNR values were varied from −2 to 10 dBs. The obtained results of different multistage BSS approaches are averaged and presented in Table 3. From Table 3, the proposed model shows performance gain in comparison to [31,32,42] for the

{\hat{S}}_{1}

and

{\hat{S}}_{2}

. Similarly, the parameter RT was varied to evaluate the proposed model robustness. The values for window size, NFFT, and SNR are 512, 1024, and 4 dBs, respectively. The results provided in Table 4 shows performance improvement of the proposed model in comparison to the methods in [31,32,42]. The results of the second model conclude that the switching between the two source priors according to the low- and high-energy components in the received convolutive mixture improve the performance from the single source prior BSS models [31,32,42].

From the results of the two proposed models, it is clear that the first model (Figure 1) performs better than the second model (Figure 2). In the first model, the estimated source speech signals are first extracted from the observed noisy convolutive mixture. Then, the noise in the separated speech signal is suppressed individually, leading to a better performance. On the other hand, the second model performs de-noising first, suppressing the noise in the estimated source signal mixed and considering the noisy convolutive mixture, resulting in performance degradation.

4.3. Subjective Evaluation

In the case of subjective evaluation, listening tests were performed to verify the simulation results obtained in Table 5, Table 6, Table 7 and Table 8. Five participants took part in the subjective evaluation experiments (two female, three male), where all the listening participants have normal hearing ability. Every listener was guided to mark a score from integer value 1 (estimated speech signals not audible) to integer 5 (estimated speech signals audible) of the extracted source speech signals from the noisy convolutive mixture. The listener listened to the original signal and the enhanced speech signals separated from the noisy reverberant mixtures using the two proposed models.

The same speech signals were chosen for both objective evaluation and subjective listening analysis in these experiments. In the multistage model presented in Figure 1, the values of the parameters for window size, NFFT, and RT were set to 512, 1024, and 100 msec. The SNR value was varied from −2 to 10 dBs. The score marked by the participants is based on the cleanness of the extracted signals from the convolutive mixture containing white Gaussian noise. The clean estimated speech signals are marked with a higher mean opinion score (MOS) and vice versa. The results obtained from the listening participants for each extracted signal are averaged and presented in Table 5. In Table 5, the proposed model shows improvement in its MOS in comparison with the multistage BSS having source priors [31,32,42]. Next, the parameter RT was varied from 40 to 200 msec with a fixed window length = 512, NFFT = 1024, and SNR = 4 dB. The averaged results obtained are provided in Table 6. From Table 6, the proposed model shows improvement over the methods in [31,32,42].

In the second multistage model shown in Figure 2, previous parameter values are used for RT, window length, and FFT frame length. The SNR parameter value was varied from −2 to 10 dBs. The averaged MOS results are presented in Table 7, showing the performance improvement of the proposed model compared to super-Gaussian [31], Student’s T [42], and generalized Gaussian [32] source priors. Next, the RT parameter wa varied from 40 to 200 msec with a window length = 512, FFT = 1024, and SNR = 4 dB. The proposed model-averaged MOS results in Table 8 reflect an improvement from [31,32,42].

4.4. Results with Colored Noise

The experiments were also performed in a noisy environment considering Pink noise for variable SNR and RT for both the proposed models as shown in Figure 3, Figure 4, Figure 5 and Figure 6. For the first proposed model, the SNR values were varied from −2 to 10 dB. The parameter values of window length, NFFT frame length, and RT are considered to 512, 1024, and 100 msec. The results were obtained from different noisy convolutive mixtures generated from the pool of speech signals previously used. It can be observed in Figure 3 that the proposed model shows improvement for estimated source signals

{\hat{S}}_{1}

and

{\hat{S}}_{2}

in comparison to the multistage BSS models having source priors [31,32,42]. Next, the RT parameter was varied to evaluate the robustness of the proposed model. The parameters such as window length, FFT, and SNR were set to 512, 1024, and 4 dB. From Figure 5, it can be concluded that the proposed model shows improvement from [31,32,42] multistage source prior models.

For the second proposed model, the same procedure was adopted for the generation of different convolutive mixtures with two speech signals and pink noise by a simulated room model [43]. The SNR parameter was varied from −2 to 10 dB, and the results were averaged. From Figure 4, the proposed model shows performance gain for

{\hat{S}}_{1}

and

{\hat{S}}_{2}

from other multistage BSS methods [31,32,42]. Similarly, the RT parameter was varied from 40 to 200 msec to evaluate the proposed multistage hybrid model. The window length, NFFT, and SNR parameters were set to 512, 1024, and 4 dB. The results reflected in Figure 6 show the performance gain of the proposed model from the literature.

4.5. Energy Distribution of Observed Mixtures

In a noisy real-time environment, the non-stationary nature of the observed convolutive mixture contains high- as well as low-energy components, as shown in Figure 7 and Figure 8. Figure 7 shows the normalized energy of different frequency bins contain in the observed convolutive mixture 1. Figure 8 reflects the energy distribution of convolutive mixture 2 for various frequency blocks.

5. Performance Evaluation

In this section, the separation performance of the proposed model is compared with various multistage BSS approaches with different source priors such as multivariate super-Gaussian [31], Student’s T [42], and generalized Gaussian distributions [32]. The two proposed multistage models use the BSS approach to separate the estimated speech signals from the noisy convolutive mixture followed by the MMSE filtering technique to de-noise the signals.

For the performance evaluation of first model, we generated 20 different noisy convolutive speech mixtures with the help of a simulated room model by randomly selecting speech signals from a pool of 10 source speech signals (five male and five female). We varied SNR and RT to obtain the average results where the SNR varies between −2 to 10 dB with window length = 512, NFFT = 1024, and RT = 100 msec. The average results are presented in Table 1 in terms of SDR, showing that the proposed model gains an enhancement of 0.3 dB for

\hat{S_{1}}

and 0.5 dB for

\hat{S_{2}}

. Moreover, the proposed model is compared with the literature, showing its effectiveness with optimum gains of 0.2 and 1 dB for both estimated speech signal

\hat{S_{1}}

and

\hat{S_{2}}

, respectively.

Additionally, RT was varied from 40 to 200 msec with a window length = 512, FFT frame length = 1024, and SNR = 4 dB. The noisy reverberant mixtures were fed to the proposed model and the other multistage BSS methodologies [31,32,42]. The average objective analysis results are presented in Table 2, which shows performance gains for the proposed model of 0.3 and 0.4 dB for

{\hat{S}}_{1}

and

{\hat{S}}_{2}

in comparison with other BSS approaches. The proposed approach also shows significant performance improvements of 3.81 and 2.9 dB for the estimated source signals

\hat{S_{1}}

and

\hat{S_{2}}

, respectively, from the multistage BSS having a source prior [42]. The objective analysis is also compared with [32], in which the proposed method shows optimum improvements of 0.16 and 0.2 dB for

{\hat{S}}_{1}

and

{\hat{S}}_{2}

, respectively.

For the performance evaluation of the second model, the noisy reverberant mixtures were de-noised by using the MMSE filtering technique in the first stage. In the second stage, the estimated speech signals were separated from the de-noised mixture using the BSS method. The average results based on objective analysis by varying SNR are shown in Table 3. It can be seen in Table 3 that the proposed model shows a performance improvement of 0.2 dB for both

{\hat{S}}_{1}

and

{\hat{S}}_{2}

in comparison with other multistage BSS approaches. The results in Table 3 show that the proposed approach achieved significant performance improvements of 0.7 and 0.8 dB in comparison with the Student’s T method [42] for estimated source signals

\hat{S_{1}}

and

\hat{S_{2}}

, respectively. Moreover, Table 3 demonstrates that the proposed approach shows 0.1 and 0.4 dB gains from [32] for

{\hat{S}}_{1}

and

{\hat{S}}_{2}

, respectively. The objective evaluations with variable RT, having a window size = 512, FFT = 1024, and SNR = 4 dB, are provided in Table 4. From Table 4, the proposed model shows performance gains of 1.7, 2.14, and 0.21 dB for

{\hat{S}}_{1}

and 0.6, 1.63, and 0.4 dB for

{\hat{S}}_{2}

in comparison with [31,32,42], respectively.

Experiments were also performed to cross verify the simulations where the window length, NFFT, and RT parameters were set to 512, 1024, and 100 msec, respectively. The five participants were asked to mark the MOS of the estimated speech signals extracted from the multistage BSS model with source prior [31,32,42] and the two proposed methods. The average MOS results from the first model are presented in Table 5 and Table 6, with variable SNR and RT, respectively. In Table 5 with variable SNR, it can be observed that the proposed approach achieved performance gains of 0.5, 0.8, and 0.2 in terms of MOS for estimated source

\hat{S_{1}}

and 0.4, 0.6, 0.12 for estimated source

\hat{S_{2}}

in comparison with other multistage BSS methods, respectively. For varying the RT parameter, having a fixed window length = 512, FFT = 1024, and SNR = 4 dB, the average MOS are shown in Table 6 with gains of 0.4, 0.6, and 0.2 for

{\hat{S}}_{1}

and 0.5, 0.7, 0.2 for

{\hat{S}}_{2}

. The same procedure was followed to verify the second proposed model and the results are displayed in Table 7 and Table 8. For varying the SNR, it can be observed in Table 7 that the proposed model achieves MOS gains of 0.4,0.7, and 0.2 for

{\hat{S}}_{1}

and 0.3, 0.7, and 0.2 for

{\hat{S}}_{2}

. Similarly, in Table 8, results are provided by varying RT that show the MOS gains of the proposed model, i.e., 0.4, 0.6, and 0.2 for

\hat{S_{1}}

and 0.2, 0.5, and 0.1 for

\hat{S_{2}}

.

Comparative Analysis of the Proposed Models

A comparative analysis of the two proposed models are presented in Figure 9a and Figure 10b for estimated speech signals

{\hat{S}}_{1}

and

{\hat{S}}_{2}

. The results of these figures are deduced from objective evaluations in Table 1, Table 2, Table 3 and Table 4 for variable SNR and RT.

From Figure 9a,b, it is clear that the first model provides significant improvements in comparison to the second model for estimated source signals

{\hat{S}}_{1}

and

{\hat{S}}_{2}

with variable SNR. Similarly, for varying RT, it can be observed in Figure 10a,b that the first model shows considerable performance gains in comparison to the second model for

{\hat{S}}_{1}

and

{\hat{S}}_{2}

. The performance of the first model (Figure 1) is better than the second model (Figure 2) for both RT and SNR because the first model suppresses the noise in the estimated source signal extracted from the noisy convolutive mixture, while in the second model, the de-noising technique suppresses noise and estimated source signals mixed in the noisy convolutive mixture. The de-noising module considers the other estimated signals in the noisy convolutive mixture as noise resulting from performance degradation.

6. Conclusions and Future Work

This paper proposes an efficient hybrid multistage approach for blind source separation (BSS) of noisy convolutive speech mixtures. In the BSS stage, a hybrid source prior model consisting of multivariate super-Gaussian and generalized Gaussian distribution models of the source signals is used in the observed noisy reverberant mixture. The weights are assigned between the source priors following the energy of the observed convolutive mixture. In the de-noising stage, the noise is suppressed by the MMSE filtering technique using two different proposed models. In the first model, the BSS module is followed by the de-noising stage. In the second model, the de-noising module is followed by the BSS stage. Both proposed models are compared with the literature, where the results clearly show the performance improvement of the proposed schemes. Furthermore, it was observed from the results that the proposed model with a BSS module followed by de-noising stage shows a significant gain in comparison with the model with first de-noising followed by the BSS stage.

In future work, we will perform experimental measurements in a real-time setup and compare the simulation and practical results both in terms of SNR, energy distribution, and error performance.

Author Contributions

The work was developed as a collaboration among all authors. J.B.K., T.J., and R.A.K. designed the study and system development. N.S. and M.A. directed the research and collaborated in discussion on the proposed system model. The manuscript was mainly drafted by J.B.K., T.J., R.A.K. and N.S. and was revised and corrected by all co-authors. All authors have read and approved the final manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/philipperemy/timit (accessed on 20 June 2021).

Conflicts of Interest

The authors declare no conflict of interest.

References

Gupta, V.; Bhowmick, A.; Chandra, M.; Sharan, S. Speech enhancement using MMSE estimation and spectral subtraction methods. In Proceedings of the 2011 International Conference on Devices and Communications (ICDeCom), Mesra, India, 24–25 February 2011; pp. 1–5. [Google Scholar]
Souden, M.; Araki, S.; Kinoshita, K.; Nakatani, T.; Sawada, H. A multichannel MMSE-based framework for speech source separation and noise reduction. IEEE Trans. Audio Speech Lang. Process. 2013, 21, 1913–1928. [Google Scholar] [CrossRef]
Enzner, G.; Thüne, P. Robust MMSE filtering for single-microphone speech enhancement. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 4009–4013. [Google Scholar]
Fenghua, Z.; Le, Y.; Jian, W.; Qiang, S. Speech signal enhancement through wavelet domain MMSE filtering. In Proceedings of the 2010 International Conference on Computer, Mechatronics, Control and Electronic Engineering, Changchun, China, 24–26 August 2010; Volume 5, pp. 118–121. [Google Scholar]
Kirubagari, B.; Palanivel, S.; Subathra, N. Speech enhancement using minimum mean square error filter and spectral subtraction filter. In Proceedings of the International Conference on Information Communication and Embedded Systems (ICICES2014), Chennai, India, 27–28 February 2014; pp. 1–7. [Google Scholar]
Khalil, R.A.; Jones, E.; Babar, M.I.; Jan, T.; Zafar, M.H.; Alhussain, T. Speech emotion recognition using deep learning techniques: A review. IEEE Access 2019, 7, 117327–117345. [Google Scholar] [CrossRef]
Khalil, R.; Ashraf, S.; Jan, T.; Jehangir, A.; Khan, J. Enhancement of Speech Signals Using Multiple Statistical Models. Sindh Univ. Res. J. SURJ Sci. Ser. 2015, 47, 519–522. [Google Scholar]
Yang, J.; Wang, Z. Blind separation algorithm for speech and noise based on diagonalizing second-order statistics accurately. In Proceedings of the 2010 2nd IEEE International Conference on Information Management and Engineering, Chengdu, China, 16–18 April 2010; pp. 370–373. [Google Scholar]
Hongyan, L.; Guanglong, R. Blind separation of noisy mixed speech signals based Independent Component Analysis. In Proceedings of the 2010 First International Conference on Pervasive Computing, Signal Processing and Applications, Harbin, China, 17–19 September 2010; pp. 586–589. [Google Scholar]
Yin, J.; Liu, Z.; Jin, Y.; Peng, D.; Kang, J. Blind Source Separation and Identification for Speech Signals. In Proceedings of the 2017 International Conference on Sensing, Diagnostics, Prognostics and Control (SDPC), Shanghai, China, 16–18 August 2017; pp. 398–402. [Google Scholar]
Rivet, B.; Girin, L.; Jutten, C. Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures. IEEE Trans. Audio Speech Lang. Process. 2006, 15, 96–108. [Google Scholar] [CrossRef]
Parikh, D.N.; Anderson, D.V. Blind source separation with perceptual post processing. In Proceedings of the 2011 Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE), Sedona, AZ, USA, 4–7 January 2011; pp. 321–325. [Google Scholar]
Rahima, H.; Djebari, M.; Mohamed, D. Blind speech enhancement and acoustic noise reduction by SFTF adaptive algorithm. In Proceedings of the 2017 5th International Conference on Electrical Engineering-Boumerdes (ICEE-B), Boumerdes, Algeria, 29–31 October 2017; pp. 1–4. [Google Scholar]
Fisli, S.; Djendi, M.; Guessoum, A. Modified predator-prey particle swarm optimization based two-channel speech quality enhancement by forward blind source separation. In Proceedings of the 2018 2nd International Conference on Natural Language and Speech Processing (ICNLSP), Algiers, Algeria, 25–26 April 2018; pp. 1–6. [Google Scholar]
Bendoumia, R.; Djendi, M.; Guessoum, A. New symmetric subband forward algorithm based on simple variable step-sizes for speech enhancement. In Proceedings of the 2017 5th International Conference on Electrical Engineering-Boumerdes (ICEE-B), Boumerdes, Algeria, 29–31 October 2017; pp. 1–6. [Google Scholar]
Bendoumia, R.; Djendi, M. Speech enhancement using backward adaptive filtering algorithm: Variable step-sizes approaches. In Proceedings of the 2015 3rd International Conference on Control, Engineering & Information Technology (CEIT), Tlemcen, Algeria, 25–27 May 2015; pp. 1–5. [Google Scholar]
Ghribi, K.; Djendi, M.; Berkani, D. Thresholding wavelet-based forward BSS algorithm for speech enhancement and complexity reduction. In Proceedings of the 2018 2nd International Conference on Natural Language and Speech Processing (ICNLSP), Algiers, Algeria, 25–26 April 2018; pp. 1–6. [Google Scholar]
Beack, S.K.; Lee, B.; Hahn, M.; Nam, S.H. Blind source separation and Kalman filter-based speech enhancement in a car environment. In Proceedings of the 2004 International Symposium on Intelligent Signal Processing and Communication Systems 2004 (ISPACS 2004), Seoul, Korea, 18–19 November 2004; pp. 520–523. [Google Scholar]
Wang, Z.Q.; Wang, D. Combining spectral and spatial features for deep learning based blind speaker separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 27, 457–468. [Google Scholar] [CrossRef]
Wang, H.; Bi, A.; Xu, P.; Gao, C. Convolutive Blind Source Separation Algorithm Based on Higher Order Statistics. In Proceedings of the 2013 Third International Conference on Intelligent System Design and Engineering Applications, Hong Kong, China, 16–18 January 2013; pp. 487–490. [Google Scholar]
Abdessamed, B.; Yahia, B.; Mohamed, D. Hands Free Communication Improvement in Airplane by a New Dual RNQ Adaptive Algorithm. In Proceedings of the 2018 International Conference on Electrical Sciences and Technologies in Maghreb (CISTEM), Algiers, Algeria, 28–31 October 2018; pp. 1–4. [Google Scholar]
Wang, C.; Zhu, X.; Li, X. Interference Suppression Based on Single-channel Blind Source Separation in Weather Radar. In Proceedings of the 2019 International Conference on Meteorology Observations (ICMO), Chengdu, China, 28–31 December 2019; pp. 1–4. [Google Scholar]
Cmejla, J.; Koldovsky, Z. Multi-channel speech enhancement based on independent vector extraction. In Proceedings of the 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), Tokyo, Japan, 17–20 September 2018; pp. 525–529. [Google Scholar]
Tang, H.; Wang, S. Noisy blind source separation based on adaptive noise removal. In Proceedings of the 10th World Congress on Intelligent Control and Automation, Beijing, China, 6–8 July 2012; pp. 4255–4257. [Google Scholar]
Routray, A.; Das, N.; Dash, P. Robust preprocessing: Denoising and whitening in the context of blind source separation of instantaneous mixtures. In Proceedings of the 2007 5th IEEE International Conference on Industrial Informatics, Vienna, Austria, 23–27 June 2007; Volume 1, pp. 377–380. [Google Scholar]
Yang, Y.; Li, Z.; Wang, X.; Zhang, D. Noise source separation based on the blind source separation. In Proceedings of the 2011 Chinese Control and Decision Conference (CCDC), Mianyang, China, 23–25 May 2011; pp. 2236–2240. [Google Scholar]
Chen, Y. Single channel blind source separation based on nmf and its application to speech enhancement. In Proceedings of the 2017 IEEE 9th International Conference on Communication Software and Networks (ICCSN), Guangzhou, China, 6–8 May 2017; pp. 1066–1069. [Google Scholar]
Yatabe, K.; Kitamura, D. Time-frequency-masking-based Determined BSS with Application to Sparse IVA. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 715–719. [Google Scholar]
Schwartz, B.; Gannot, S.; Habets, E.A. Two model-based EM algorithms for blind source separation in noisy environments. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 2209–2222. [Google Scholar] [CrossRef]
Wood, S.U.; Rouat, J. Unsupervised low latency speech enhancement with RT-GCC-NMF. IEEE J. Sel. Top. Signal Process. 2019, 13, 332–346. [Google Scholar] [CrossRef] [Green Version]
Kim, T.; Attias, H.T.; Lee, S.Y.; Lee, T.W. Blind source separation exploiting higher-order frequency dependencies. IEEE Trans. Audio Speech Lang. Process. 2006, 15, 70–79. [Google Scholar] [CrossRef]
Liang, Y.; Naqvi, S.M.; Wang, W.; Chambers, J.A. Frequency domain blind source separation based on independent vector analysis with a multivariate generalized Gaussian source prior. In Blind Source Separation; Springer: Raleigh, NC, USA, 2014; pp. 131–150. [Google Scholar]
Rafique, W.; Erateb, S.; Naqvi, S.M.; Dlay, S.S.; Chambers, J.A. Independent vector analysis for source separation using an energy driven mixed Student’s t and super Gaussian source prior. In Proceedings of the 2016 24th European Signal Processing Conference (EUSIPCO), Budapest, Hungary, 29 August–2 September 2016; pp. 858–862. [Google Scholar]
Khan, J.B.; Jan, T.; Khalil, R.A.; Altalbe, A. Hybrid Source Prior Based Independent Vector Analysis for Blind Separation of Speech Signals. IEEE Access 2020, 8, 132871–132881. [Google Scholar] [CrossRef]
Soon, Y.; Koh, S.N.; Yeo, C.K. Noisy speech enhancement using discrete cosine transform. Speech Commun. 1998, 24, 249–257. [Google Scholar] [CrossRef]
Ephraim, Y.; Malah, D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 1984, 32, 1109–1121. [Google Scholar] [CrossRef] [Green Version]
Ephraim, Y.; Malah, D. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 1985, 33, 443–445. [Google Scholar] [CrossRef]
Boll, S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 1979, 27, 113–120. [Google Scholar] [CrossRef] [Green Version]
McAulay, R.; Malpass, M. Speech enhancement using a soft-decision noise suppression filter. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 137–145. [Google Scholar] [CrossRef]
Kim, T.; Lee, I.; Lee, T.W. Independent vector analysis: Definition and algorithms. In Proceedings of the 2006 Fortieth Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 29 October–1 November 2006; pp. 1393–1396. [Google Scholar]
Garofolo, J.S. TIMIT Acoustic Phonetic Continuous Speech Corpus; Linguistic Data Consortium: Philadelphia, PA, USA, 1993. [Google Scholar]
Rafique, W.; Naqvi, S.M.; Jackson, P.J.; Chambers, J.A. IVA algorithms using a multivariate student’s t source prior for speech source separation in real room environments. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 474–478. [Google Scholar]
Allen, J.B.; Berkley, D.A. Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Am. 1979, 65, 943–950. [Google Scholar] [CrossRef]

Figure 1. First model.

Figure 2. Second model.

Figure 3. Change in SDR for convolutive pink noisy mixture with variable SNR for the first proposed model.

Figure 4. Change in SDR for convolutive pink noisy mixture with variable SNR for the 2nd proposed model.

Figure 5. Change in SDR for convolutive pink noisy mixture with variable RT for the 1st proposed model.

Figure 6. Change in SDR for convolutive pink noisy mixture with variable RT in the 2nd proposed model.

Figure 7. Normalized energy in different frequency bins of observed mixture 1.

Figure 8. Normalized energy in different frequency bins of observed mixture 2.

Figure 9. (a) Comparison of the two proposed models for

{\hat{S}}_{1}

with variable SNR (dB); (b) Comparison of the two proposed models for

{\hat{S}}_{2}

with variable SNR (dB).

Figure 9. (a) Comparison of the two proposed models for

{\hat{S}}_{1}

with variable SNR (dB); (b) Comparison of the two proposed models for

{\hat{S}}_{2}

with variable SNR (dB).

Figure 10. (a) Comparison of the two proposed models for

{\hat{S}}_{1}

with variable RT (msec); (b) Comparison of the two proposed models for

{\hat{S}}_{2}

with variable RT (msec).

Figure 10. (a) Comparison of the two proposed models for

{\hat{S}}_{1}

with variable RT (msec); (b) Comparison of the two proposed models for

{\hat{S}}_{2}

with variable RT (msec).

Table 1. Average SNR results for the first proposed model shown in Figure 1 with variable SNR for multistage BSS models having different source priors.

	Multivariate		Student’s T		Generalized		Proposed Model
	Gaussian		Distribution		Gaussian
SNR	Source Prior [31]		Source Prior [42]		Source Prior [32]
(dB)	$Δ$ SDR $S_{1}$	$Δ$ SDR $S_{2}$	$Δ$ SDR $S_{1}$	$Δ$ SDR $S_{2}$	$Δ$ SDR $S_{1}$	$Δ$ SDR $S_{2}$	$Δ$ SDR $S_{1}$	$Δ$ SDR $S_{2}$
−2	10.22	6.17	8.05	4.57	10.30	6.24	10.44	6.35
0	9.58	5.36	7.49	4.18	9.73	5.39	9.83	5.41
2	9.30	5.08	5.98	2.02	9.51	5.31	9.66	5.33
4	8.80	3.49	5.82	1.86	8.85	5.05	9.31	5.12
6	8.75	3.38	5.57	1.14	8.81	3.48	8.84	3.57
8	8.62	2.37	5.33	1.00	8.71	3.36	8.81	3.42
10	8.31	2.01	5.21	0.26	8.39	2.33	8.53	2.41

Table 2. Average RT results for the first proposed model shown in Figure 1 with variable RT for multistage BSS models having different source priors.

	Multivariate		Student’s T		Generalized		Proposed Model
	Gaussian		Distribution		Gaussian
RT	Source Prior [31]		Source Prior [42]		Source Prior [32]
(ms)	$Δ$ SDR $S_{1}$	$Δ$ SDR $S_{2}$	$Δ$ SDR $S_{1}$	$Δ$ SDR $S_{2}$	$Δ$ SDR $S_{1}$	$Δ$ SDR $S_{2}$	$Δ$ SDR $S_{1}$	$Δ$ SDR $S_{2}$
40	16.10	7.47	9.64	2.27	16.29	7.56	16.37	7.68
80	12.45	5.79	6.94	2.00	12.65	5.72	12.81	5.87
120	7.06	3.02	5.92	1.65	7.26	3.22	7.41	3.39
160	4.49	2.11	2.88	1.04	4.63	2.63	4.83	2.85
200	3.92	1.62	2.25	0.28	4.01	1.81	4.23	1.97

Table 3. Average SNR results for the second proposed model shown in Figure 2 with variable SNR for multistage BSS models having different source priors.

	Multivariate		Student’s T		Generalized		Proposed Model
	Gaussian		Distribution		Gaussian
SNR	Source Prior [31]		Source Prior [42]		Source Prior [32]
(dB)	$Δ$ SDR $S_{1}$	$Δ$ SDR $S_{2}$	$Δ$ SDR $S_{1}$	$Δ$ SDR $S_{2}$	$Δ$ SDR $S_{1}$	$Δ$ SDR $S_{2}$	$Δ$ SDR $S_{1}$	$Δ$ SDR $S_{2}$
−2	5.76	4.48	3.79	5.39	5.81	4.67	5.87	4.51
0	4.37	3.74	3.44	2.86	4.39	3.38	4.53	3.87
2	3.86	3.34	3.37	2.56	3.90	2.68	3.96	3.52
4	3.45	2.83	2.80	2.13	3.57	2.41	3.61	2.85
6	2.34	2.18	2.37	1.02	2.40	1.95	2.43	2.47
8	1.79	1.49	1.40	0.35	1.90	1.36	2.05	1.70
10	0.70	1.19	0.63	0.11	0.81	1.22	1.02	1.45

Table 4. Average RT results for the second proposed model shown in Figure 2 with variable RT for multistage BSS models having different source priors.

	Multivariate		Student’s T		Generalized		Proposed Model
	Gaussian		Distribution		Gaussian
RT	Source Prior [31]		Source Prior [42]		Source Prior [32]
(ms)	$Δ$ SDR $S_{1}$	$Δ$ SDR $S_{2}$	$Δ$ SDR $S_{1}$	$Δ$ SDR $S_{2}$	$Δ$ SDR $S_{1}$	$Δ$ SDR $S_{2}$	$Δ$ SDR $S_{1}$	$Δ$ SDR $S_{2}$
40	10.86	6.32	6.30	2.50	10.85	6.35	10.97	6.43
80	2.38	2.52	4.34	2.07	4.55	2.62	4.87	3.14
120	2.11	1.81	3.23	1.97	4.09	2.12	4.29	2.53
160	1.72	1.56	1.19	1.16	3.82	2.06	3.91	2.39
200	1.40	1.19	1.05	0.38	2.46	1.24	2.76	1.74

Table 5. Average MOS results of the subjective evaluation for the first model shown in Figure 1 with variable SNR for multistage BSS models having different source priors.

	Multivariate		Student’s T		Generalized		Proposed Model
	Gaussian		Distribution		Gaussian
SNR	Source Prior [31]		Source Prior [42]		Source Prior [32]
(dB)	MOS for $S_{1}$	MOS for $S_{2}$	MOS for $S_{1}$	MOS for $S_{2}$	MOS for $S_{1}$	MOS for $S_{2}$	MOS for $S_{1}$	MOS for $S_{2}$
−2	1.57	1.71	1.45	1.55	1.72	1.81	2.01	1.96
0	2.13	2.37	1.98	2.15	2.45	2.62	2.83	2.77
2	2.57	2.62	2.34	2.44	2.88	2.76	3.21	2.99
4	3.12	2.87	2.73	2.67	3.56	3.22	3.87	3.58
6	3.95	3.25	3.46	3.11	4.17	3.49	4.21	3.67
8	4.37	3.63	3.88	3.34	4.42	3.86	4.53	3.93
10	4.46	4.13	4.13	3.96	4.61	4.58	4.69	4.26

Table 6. Average MOS results of the subjective evaluation for the first model shown in Figure 1 with variable RT for multistage BSS models having different source priors.

	Multivariate		Student’s T		Generalized		Proposed Model
	Gaussian		Distribution		Gaussian
RT	Source Prior [31]		Source Prior [42]		Source Prior [32]
(ms)	MOS for $S_{1}$	MOS for $S_{2}$	MOS for $S_{1}$	MOS for $S_{2}$	MOS for $S_{1}$	MOS for $S_{2}$	MOS for $S_{1}$	MOS for $S_{2}$
40	3.94	4.01	3.86	3.70	4.10	4.23	4.34	4.67
80	3.72	3.79	3.52	3.39	3.94	3.86	4.21	4.18
120	3.18	2.75	2.63	2.57	3.49	3.18	3.68	3.44
160	2.81	2.58	2.46	2.43	2.95	2.87	3.03	2.96
200	2.34	2.25	2.17	2.08	2.46	2.37	2.58	2.47

Table 7. Average MOS results of the subjective evaluation for the second model shown in Figure 2 with variable SNR for multistage BSS models having different source priors.

	Multivariate		Student’s T		Generalized		Proposed Model
	Gaussian		Distribution		Gaussian
SNR	Source Prior [31]		Source Prior [42]		Source Prior [32]
(dB)	MOS for $S_{1}$	MOS for $S_{2}$	MOS for $S_{1}$	MOS for $S_{2}$	MOS for $S_{1}$	MOS for $S_{2}$	MOS for $S_{1}$	MOS for $S_{2}$
−2	1.23	1.27	1.07	1.24	1.37	1.30	1.55	1.48
0	1.91	2.08	1.74	1.63	2.21	2.26	2.43	2.39
2	2.24	2.31	1.98	1.78	2.53	2.49	2.72	2.58
4	2.81	2.67	2.46	2.27	2.98	2.81	3.15	3.04
6	3.22	2.91	2.83	2.71	3.45	3.22	3.59	3.45
8	3.39	3.21	2.96	2.88	3.62	3.47	3.82	3.51
10	3.54	3.45	3.20	3.13	3.79	3.55	3.92	3.68

Table 8. Average MOS results of the subjective evaluation for the second model shown in Figure 2 with variable RT for multistage BSS models having different source priors.

	Multivariate		Student’s T		Generalized		Proposed Model
	Gaussian		Distribution		Gaussian
RT	Source Prior [31]		Source Prior [42]		Source Prior [32]
(ms)	MOS for $S_{1}$	MOS for $S_{2}$	MOS for $S_{1}$	MOS for $S_{2}$	MOS for $S_{1}$	MOS for $S_{2}$	MOS for $S_{1}$	MOS for $S_{2}$
40	3.39	3.52	3.43	3.51	3.61	3.55	3.89	3.63
80	3.19	3.35	3.05	3.16	3.35	3.48	3.55	3.52
120	2.79	2.58	2.38	2.18	2.91	2.67	3.02	2.88
160	2.51	2.35	2.23	1.92	2.73	2.46	2.95	2.54
200	2.36	2.14	1.89	1.67	2.51	2.33	2.68	2.39

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Khan, J.B.; Jan, T.; Khalil, R.A.; Saeed, N.; Almutiry, M. An Efficient Multistage Approach for Blind Source Separation of Noisy Convolutive Speech Mixture. Appl. Sci. 2021, 11, 5968. https://doi.org/10.3390/app11135968

AMA Style

Khan JB, Jan T, Khalil RA, Saeed N, Almutiry M. An Efficient Multistage Approach for Blind Source Separation of Noisy Convolutive Speech Mixture. Applied Sciences. 2021; 11(13):5968. https://doi.org/10.3390/app11135968

Chicago/Turabian Style

Khan, Junaid Bahadar, Tariqullah Jan, Ruhul Amin Khalil, Nasir Saeed, and Muhannad Almutiry. 2021. "An Efficient Multistage Approach for Blind Source Separation of Noisy Convolutive Speech Mixture" Applied Sciences 11, no. 13: 5968. https://doi.org/10.3390/app11135968

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Efficient Multistage Approach for Blind Source Separation of Noisy Convolutive Speech Mixture

Abstract

1. Introduction

1.1. Background

1.2. Contributions

1.3. Organization

2. Signal Model

3. Proposed Multistage BSS Approach

4. Results and Discussion

4.1. Experimental Setup

4.2. Objective Evaluation

4.3. Subjective Evaluation

4.4. Results with Colored Noise

4.5. Energy Distribution of Observed Mixtures

5. Performance Evaluation

Comparative Analysis of the Proposed Models

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI