1. Introduction
Speech enhancement is essential to ensure the satisfactory perceptual quality and intelligibility of speech signals in many speech applications, such as hearing aids and speech communication with mobile phones and hands-free systems [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40,
41,
42,
43]. Currently, devices with multiple microphones are popular, which has enabled multi-microphone speech enhancement, exploiting spatial information as well as spectro-temporal characteristics of the input signals [
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40,
41,
42,
43,
44,
45,
46,
47,
48]. One of the most popular approaches to multi-microphone speech enhancement may be spatial filtering in the time–frequency domain, which aims to extract a target speech signal from multiple microphone signals contaminated by background noise and reverberation by suppressing sounds from directions other than the target direction [
6,
7,
8,
9,
10,
11].
There have been various types of spatial filters with different optimization criteria [
6,
7,
8,
9,
10]. Among them, the minimum mean square error (MMSE) criterion for speech spectra estimation led to the multi-channel Wiener filter (MWF), which has shown decent performance [
10,
12,
21,
22]. It has been shown that the MWF solution can be decomposed into the concatenation of the minimum-variance distortionless-response (MVDR) beamformer and the single-channel postfilter [
11,
12]. Spatial filters often require the estimation of acoustic parameters such as the relative transfer function (RTF) between the microphones and the speech and noise spatial covariance matrices (SCMs), which should be estimated from the noisy observations.
For applications such as speech communication and hearing aids, time delays are crucial, and thus an online algorithm is required for multi-microphone speech enhancement. The work in [
25] extended the single-channel minima controlled recursive averaging (MCRA) framework [
49,
50] for noise estimation in the multi-channel case by introducing the multi-channel speech presence probability (SPP) [
24]. In [
26], a coherence-to-diffusion ratio (CDR) based
a priori SPP estimator under the expectation and maximization (EM) framework was proposed to improve the robustness in nonstationary noise scenarios. In [
25,
26], the speech SCM was estimated with the maximum likelihood (ML) approach, while the multi-channel decision-directed (DD) estimator was proposed in [
29]. In [
27], the recursive EM (REM) algorithm, which performs an iterative estimation for the latent variables and model parameters in the current frame, was exploited by defining the exponentially weighted log-likelihood of the data sequence. The speech SCM was decomposed into the speech power spectral density (PSD) and RTF under the rank-1 approximation, and these components were estimated by an ML approach using the EM algorithm in [
27].
In this paper, we propose an improved speech SCM estimation for online multi-microphone speech enhancement. First, we adopt the temporal cepstrum smoothing (TCS) approach [
51] to estimate the speech PSD, which has not yet been tried in multi-channel cases. Furthermore, we propose an RTF estimator based on time difference of arrival (TDoA) estimation using the cross-correlation method. Finally, we propose refining the acoustic parameters by exploiting the clean speech spectrum and clean speech power spectrum estimated in the first pass. The experimental results show that the proposed speech enhancement framework exhibited improved performance in terms of the perceptual evaluation of speech quality (PESQ) scores, extended short-time objective intelligibility (eSTOI), and scale-invariant signal to distortion ratio (SISDR) for the CHiME-4 database. Additionally, we performed an ablation study to understand how each sub-module contributed to the performance improvement.
The remainder of this paper is organized as follows.
Section 2 briefly introduces the previous work on multi-microphone speech enhancement depending on various classes of approaches and then summarizes the main contributions of our proposal.
Section 3 reviews the previous MMSE multi-channel speech enhancement approach and explains the conventional speech and noise SCM estimation.
Section 4 presents the proposed speech SCM estimation based on the novel speech PSD and RTF estimators.
Section 5 outlines the experimental results that demonstrate the superiority of the proposed method compared with the baseline in terms of speech quality and intelligibility. Finally, a conclusion is provided in
Section 6.
2. Previous Work and Contributions
Recently, many approaches to multi-microphone speech enhancement have been proposed. In [
33], the estimation of the speech PSD reduces to seek a unitary matrix and the square roots of PSDs based on the factorization of the speech SCM. The RTF estimate was recursively updated based on these estimates. They also proposed a desmoothing of the generalized eigenvalues to maintain the non-stationarities of estimated PSDs. Furthermore, these parameter estimates were then exploited for a Kalman filter-based speech separation algorithm [
35]. In the context of sound field analysis, ref. [
34] proposed a masking scheme under the non-negative tensor factorization model and [
36] exploited the sparse representation in a spherical harmonic domain. The work in [
37] proposed a multi-channel non-negative factorization algorithm in the ray space transform domain.
Deep-learning-based approaches have also been proposed, which can be categorized into several types. One is the combination of deep learning with conventional beamforming methods, in which the deep neural networks (DNNs) are employed to implement beamforming [
38,
39]. In [
38], the complex spectral mapping approach was proposed to estimate the speech and noise SCMs. In contrast, ref. [
39] reformulated the MVDR beamformer as a factorized form associated with two complex components and estimated them using a DNN, instead of estimating the parameters of the MVDR beamformer. The other approach is neural beamforming, in which a DNN directly learns the relationship between multiple noisy inputs and outputs in an end-to-end way [
40,
41,
42,
43]. In [
40], they defined spatial regions and proposed a non-linear filter that suppresses signals from the undesired region while preserving signals from the desired region. In [
41], the authors proposed an end-to-end system to estimate the time-domain filter-and-sum beamformer coefficient using a DNN. This approach was later replaced with implicit filtering in latent space [
42]. In [
43], they built a causal neural filter comprising modules for fixed beamforming, beam filtering, and residual refinement in the beamspace domain.
One of the popular approaches that adapt the spatial filter according to the dynamic acoustic condition is the informed filter, which is computed by utilizing the instantaneous acoustic parametric information [
15,
16,
17,
18]. Refs. [
15,
16] exploited the instantaneous direction of arrival (DoA) estimates to find the time-varying RTF used to construct the spatial filter, and [
18] formulated a Bayesian framework under the DoA uncertainty. In [
19], the eigenvector decomposition was applied to the estimated speech SCM to extract the steering vectors, which were used for the MVDR beamformer. The aforementioned approaches often adopted classical techniques such as ESPRIT [
52] or MUSIC [
53] for DoA estimation, which may be improved by incorporating more sophisticated sound localization [
47,
48].
Another set of studies focus on the estimation of the acoustic parameters. An EM algorithm [
14] was employed to perform a joint estimation of the signals and acoustic parameters. While clean speech signals were obtained in the E-step, the PSDs of signals, RTF, and SCMs were estimated in the M-step. As the previous EM algorithm processed all of the signal samples at once, REM algorithms [
27,
28] overcame these issues by carrying out frame-wise iterative processing to handle online scenarios. For the speech PSD estimation, ref. [
32] proposed an instantaneous PSD estimation method based on generalized principal components to preserve the non-stationarity of speech signals. For the RTF estimation, previous approaches mainly exploited the sample SCMs [
46]. The covariance subtraction (CS) approaches [
44,
45] estimated the RTF by taking the normalized first column of the SCM obtained by the subtraction of the noisy speech SCM and noise SCM, assuming that the rank of the speech SCM was one. On the other hand, the covariance whitening (CW) approaches [
30,
54] normalized the dewhitened principal eigenvector of the whitened noisy input SCM to obtain the RTF.
In this paper, we propose an improved speech SCM estimation method for the online multi-microphone speech enhancement system based on the MVDR beamformer–Wiener filter factorization. The main contributions of our proposals are as follows:
A speech PSD estimator based on the TCS scheme to take the knowledge on the speech signal in the cepstral domain into account;
An RTF estimator based on the TDoA estimate to take advantage of the information from all frequency bins, especially when the signal-to-noise ratio (SNR) is low;
The refinement of the acoustic parameter estimates by exploiting the clean speech spectrum and clean speech power spectrum estimated in the first pass.
4. Proposed Speech SCM Estimation
Figure 2 illustrates the block diagram of the proposed speech enhancement system. As in [
25,
26,
27], the estimation of the speech and relevant statistical parameters is performed twice for each frame, which was shown to be effective for online speech enhancement. In this paper, we propose an improved method for speech SCM estimation, i.e., speech PSD estimation and RTF estimation with a rank-1 approximation, using the speech enhancement system described in
Figure 2. Note that the proposed modules are highlighted with red boxes.
In the first pass, we exploit the noisy input
in the current frame and the noise SCM estimate
obtained in the previous frame to estimate the acoustic parameters in the current frame and perform beamforming and postfiltering, as explained in
Section 3.2.
The ML estimate of the speech PSD at the first microphone using an instantaneous estimate of the PSD of input noisy signal can be obtained as
where
is the
th component of
, and
is a certain minimum value for the speech PSD estimate, which is set as
with a tunable parameter
. To estimate the speech PSDs, the ML estimation with temporal smoothing has been commonly used as described in (
16) and (
17) [
25,
26,
27]. However, this approach occasionally results in undesired temporal smearing of speech [
51]. In this paper, we propose to apply TCS [
51] to
in (
26). TCS is a selective temporal smoothing technique in the cepstral domain motivated by the observation that, although the excitation component resides in a limited number of cepstral coefficients dependent on the pitch frequency, the speech spectral envelope is well-represented by the cepstral bins with low indices [
55]. Specifically, the TCS consists of the following procedure: First, the cepstrum of ML speech PSD estimate
is computed by the inverse discrete Fourier transform (IDFT) of
. Next, the selective smoothing is applied to
, in which the cepstral bins that are less relevant to speech are smoothed more and those representing the spectral envelope and fundamental frequency are less smoothed. Finally, the discrete Fourier transform is used to convert
into the TCS-based speech PSD estimate in the spectral domain
. The bias compensation for the reduced variance due to the cepstral smoothing can be found in [
56], and a detailed description of the adaptation of the smoothing parameters and the fundamental frequency estimation is given in [
51]. In this paper, we denote the aforementioned procedure of TCS as an operation:
in which the superscript
indicates that this is the estimate in the first pass.
In this paper, we model the RTF vector
g as a relative array propagation vector, which depends on the DoA [
16]. Note that the conventional approaches in [
27,
44] estimate the RTF for each frequency using the input statistics in the frequency bin, ignoring the inter-frequency dependencies. In the presence of heavy noise, the accurate estimation of the RTF may become difficult, and thus it would be beneficial to estimate TDoA by utilizing the input signal in all frequency bins and to reconstruct the RTF using the simplest model. The TDoA for the desired speech can be obtained from the estimate of the cross-PSD of the desired speech,
, using the cross-correlation method [
57]. The TDoA estimate
between the first and the
mth microphones is given by
in which
is the estimate of
. Then, the TDoA-based RTF estimator can be obtained as
where
.
In the first pass, the cross-PSD estimate
can be obtained by taking the
element of the ML speech SCM estimate
in (
16) as
where
in which
is an all-zero vector of length
n;
can be computed using (
28) and (
29) with
, and
can be obtained as in (
18) using
in (
27) and
. The noise SCM is estimated with the multi-channel MCRA approach in (
11) utilizing
in (
15) computed with
and
. Then, we can compute the beamformer output
Z in (
6) and
in (
9), and the estimate for the speech spectrum,
, can be obtained as in (
10).
In the second pass, we estimate the acoustic parameters again by additionally utilizing the estimates for the clean speech spectrum, clean speech power spectrum, and a posteriori SPP, computed in the first pass. These refined parameters are in turn used to estimate the clean speech once again.
To refine the estimate of the speech PSD, we apply the TCS to the clean speech power spectrum estimate
in (
24) as
in which the superscript
indicates it is the refined estimate in the second pass. As
would be less affected by the noise compared with the
by virtue of beamforming and the MMSE estimation,
would be more accurate than
. As for the RTF estimation,
in (
22) is evaluated with
in (
10), as in [
27]. Instead of using
divided by the estimate of the speech PSD in the first microphone to obtain the RTF, as in [
27], we again estimate the RTF based on the TDoA;
can be computed by extracting the
mth element of
as
in contrast to (
30). The TDoA-based RTF estimate in the second pass,
, can be obtained through (
28) and (
29) with
. As in the first pass,
is computed with
in (
31) and
, and
in (
15) is updated with
. Then,
and
are obtained again using (
15) and (
11), and then the beamformer output
Z and
are updated using (
6) in (
9). The final clean speech estimate
is obtained by (
10) using
,
, and
. The whole procedure of the proposed online multi-microphone speech enhancement method is summarized in Algorithm 1.
Algorithm 1 Proposed multi-microphone speech enhancement algorithm with improved speech SCM estimation. |
- 1:
Inputs: y for all frames - 2:
Output: for all frames - 3:
Initialize variables and parameters - 4:
for each frame do - 5:
Compute using CDR-based [ 26] or DNN-based [ 27] method - 6:
(First pass) - 7:
Compute via ( 26) and ( 27) - 8:
Compute via ( 16), ( 28)–( 30) - 9:
Estimate and via ( 11), ( 15) and ( 18) - 10:
Beamformer: Compute Z and via ( 6) and ( 9) - 11:
Postfilter: Compute via ( 10) - 12:
(Second pass) - 13:
Compute via ( 24) and ( 31) - 14:
Compute via ( 22), ( 28), ( 29) and ( 32) - 15:
Estimate and via ( 11), ( 15) and ( 18) - 16:
Beamformer: Compute Z and via ( 6) and ( 9) - 17:
Postfilter: Compute via ( 10) - 18:
end for
|