1. Introduction
Compressive sampling (CS) aims to recover a sparse signal from its far fewer samples than that required by the Nyquist–Shannon sampling theorem [
1]. The CS procedure can be mathematically expressed as:
where
x is the sparse signal,
is the measurement vector with
,
is the sensing matrix. A popular choice of the sensing matrix is the random matrix such as Gaussian matrix.
is the additive noise. Many CS methods, such as reweighted minimization [
2], orthogonal matching pursuit [
3], iteratively shrinkage-thresholding algorithm (ISTA) [
4] and Bayesian CS [
5], can successfully recover sparse signal from the CS samples under certain conditions. In fact, many natural signals are sparse or compressible when represented in a proper basis. This makes CS methods having wide applications in signal sampling and processing. As to speech signals, they are nearly compressible in certain conventional basis, such as fast Fourier transform, modified discrete cosine transform (MDCT) and time–frequency domain [
6], and they should be exactly reconstucted from its CS samples theoretically. However, traditional CS methods all can not achieve satisfactory performance when applied to speech signals.
The reason lies in the fact that speech can be seen as a compound signal and is only nearly compressible. In fact, phoneticists usually divide the speech into two categories: voiced sound and unvoiced sound. The voiced sound (e.g., vowel) is produced by the vibration of vocal cord whose spectrum contains only the fundamental tone and its sparse harmonics. The unvoiced sound (e.g., frication) is produced by air turbulence within a human vocal tract whose spectrum is very similar to a white noise and contains low sparsity. Therefore, it is difficult to recover speech from its CS samples. To solve this problem, the structure of the speech spectrogram can be exploited.
In this paper, we propose a CS recovery method which combines approximate message passing (AMP) [
7] and Markov chain to exploit the correlation inner and inter the speech frames. Firstly, we employ a Bernoulli–Gaussian mixture to model the marginal distribution of each MDCT coefficient, which is determined by two hidden variables, support and amplitude. The support is the index of the Gaussian function. We assume that the variation of the supports can be characterized by a first-order Markov chain. Secondly, we build a factor graph to conveniently describe the structure of the coefficient matrix and the CS procedure. By means of message passing on the factor graph [
8], we can infer the distribution of each MDCT coefficient from CS samples. The inference process can be described as follows. It first estimates the distribution of the MDCT coefficients within each frame from the measurement vector by means of AMP and then passes the message to its hidden variables. Then, the belief propagates within the support matrix in a specific order [
9], updating the value of each hidden variable. The belief propagation within the support matrix has exploited the correlation between neighboring elements in the coefficient matrix. At last, the message from the hidden variable is passed back to the corresponding MDCT coefficient, updating the marginal distribution of each coefficient. Then the AMP is executed in each frame again. The whole process will be iterated several times until a stopping condition is met. This alternately iterative process can be refered to as turbo message passing. By this method, we not only exploit the temporal correlation of voiced sound but also make full use of the correlation of the MDCT coefficients in the unvoiced sound frame, leading to an obvious performance improvement compared to traditional CS methods.
The main contributions of the paper can be highlighted as follows: (1) We exploited the structure of speech spectrogram to improve the recovery from CS samples. Both the temporal correlation for the voiced sound and the correlation between neighboring MDCT coefficients in the unvoiced sound frame have been utilized in a turbo message passing framework. A stopping condition in turbo iteration is proposed to guarantee a reliable recovery of speech signal. (2) This method can be applied to speech enhancement and achieve a satisfactory performance.
2. Related Work
In this section, we introduce related CS methods. Some of them can be employed to model the temporal correlation of the voiced sound, others can be employed to model the correlation between neighboring MDCT coefficients in the unvoiced sound frame.
The harmonics of voiced sound may span within a few neighboring frames and show strong temporal correlation. In fact, the problem of recovering vectors with temporal correlation from its CS samples is referred to as the multiple measurement vector (MMV) problem. To solve this problem, lots of methods are developed. Lee et al. proposed a SA-MUSIC algorithm to estimate the indexes of nonzero rows when the row rank of the unknown matrix is defective [
10]. However, it is apt to allocate all the signal energy to low-frequency components when applied to a speech signal. Vaswani et al. proposed a modified-CS algorithm to recursively reconstruct time sequences of a sparse spatial signal [
11]. They used the support of the previous time instant as the “known” part for the current time instant. This algorithm can exploit the temporal correlation between neighboring frames. However, it runs slowly, because the current frame cannot be processed until its previous frame has been reconstructed. Ziniel et al. proposed the AMP-MMV algorithm that assumes the vectors in the MMV problem share common support and the amplitude corresponding the same support is temporally correlated [
12]. This temporal correlation is modeled as a Gaussian–Markov procedure. However, when applied to the speech signal, the assumption of common support is not always true. In fact, this is the dynamic CS problem, where the support of unknown vectors changes slowly. To solve this problem the authors proposed DCS-AMP algorithm [
13]. They modeled the changes in the support over time as a Markov chain. This algorithm is adequate for the reconstruction of the voiced sound from CS samples, but not for the unvoiced sound. Apart from these methods, temporal correlation can be exploited by another model, e.g., the linear prediction model. Ji et al. proposed to leverage both the speech samples and its linear prediction coefficients jointly to learn a structured dictionary [
14]. Based on this dictionary, a high reconstruction quality and a fast convergence can be achieved. However, a training dataset is needed and the dictionary is ad hoc.
The spectrum of the unvoiced sound usually contains smaller and denser frequency components and at the same time maintains a shorter duration, as a result, the temporal correlation among neighboring frames is weaker. The variation range of its amplitude is smaller as well. To exploit the structure of the unvoiced sound, several statistical models have been proposed. Févotte et al. [
15] modeled the transient part (e.g., attacks of notes) of musical signal as a Markov chain along the frequency axis. They used a Gibbs sampler to learn the model parameters and finally reduced the “musical noise”. The similarity of neighboring supports in a Markov chain is called persistency in the paper. We adopted this concept in this paper. Jiang et al. [
16] modeled the persistency of supports in each frame of speech signal as a Markov chain. They showed that the exploitation of the dependence between neighboring frequency components has improved the reconstruction quality. However, they ignored the temporal correlation between neighboring frames found in the spectrogram. Som et al. further modeled the relationship between neighboring elements in a two-dimensional unknown matrix [
17] as a Markov random field. However, due to its simplistic assumption on the probability density function of each element, this method only achieved a little better reconstruction quality than conventional methods.
3. Speech Signal Model
In this section, we first approximate the marginal distribution of each MDCT coefficient of speech signal as a Bernolli–Gaussian-mixture in
Section 3.1, and then model the correlation between neighboring support variables in the spectrogram as a Markov chain in
Section 3.2.
The speech signal is first transformed into a coefficient matrix X
by a windowed MDCT transform. Wilson or normal window is chosen to generate WMDCT basis. Then each column vector
xt, which is composed of MDCT coefficients in frame
t, is multiplied by a partial Fourier matrix
. This is the CS procedure in the MDCT domain, which can be mathematically expressed as:
where
yt is the measurement vector, each element of which is a CS sample. The ratio
can be defined as the undersampling ratio.
3.1. The Marginal Distribution of the Coefficient
We approximate the marginal distribution of each nonzero MDCT coefficient by a Gaussian-mixture, and represent zero coefficient using Dirac function. Then, the probability density function of each MDCT coefficient can be expressed as:
where subscript
n ∈ [1, …,
N] index frequency bins, subscript
t ∈ [1, …,
T] index frames;
,
and
denote the weight, mean and variance of the
l-th Gaussian function, respectively. We suppose that the number of Gaussian functions is
L. The weights satisfy the normalizing constraint
. These hyper-parameters can be learned from measurement vectors using the expectation maximization (EM) algorithm [
18].
3.2. The Relationship between Neighboring Coefficients
We assume that each nonzero coefficient
x is drawn from one of the
L Gaussian distributions. This distribution can be indexed by a nonnegative integer s
=
l, (
). When
x is zero,
s is set as 0. So
s indicates whether a coefficient is nonzero and can play the role of the support variable. Depending on the support,
x can be expressed as:
where
is the amplitude of
l-th Gaussian function. So, the MDCT coefficient can be determined by support variable and amplitude variable.
The persistency of MDCT coefficients means that the neighboring coefficients may share the same support variable. This persistency can be modeled as a first order Markov chain. Specifically, for a frequency index n, the vector [s, s, …, s] across all frames forms a Markov chain. A Markov chain can be used to describe the slow changes in neighboring supports. This change can be described by transition probability, which can be defined as, e.g., P(s|s). This transition propability and the steady-state distribution P(s) can fully describe the Markov chain. These hyper-parameters of Markov chain can be learned using EM algorithm.
5. Result and Discussion
In this section, we reconstruct speech signal from CS samples using the proposed algorithm, and make a comparison with other algorithms, including DCS-AMP [
13], EM-GM-AMP [
18], MRF [
17] and FISTA [
21]. DCS-AMP models the marginal distribution of each MDCT coefficient as a Bernoulli–Gaussian-mixture and the persistency of supports along the time axis as a Markov chain. It aims to recover sparse, slowly time-varying vectors from CS samples. EM-GM-AMP models the marginal distribution of each nonzero coefficient as a Gaussian-mixture and learns hyper-parameters using EM algorithm. It does not exploit the signal structure information. The MRF method models the marginal distribution of each coefficient as a Bernoulli–Gaussian and the structure of the support matrix as a Markov random field. FISTA is a modification of ISTA [
4]. It makes a compromise between the squared error and signal sparsity. It does not exploit the signal structure information as well.
In all experiments, the speech signal is partitioned into frames using the same window function. The neighboring frames have a 50% overlap. The sensing matrix is a time-invariant, partial Fourier matrix .
First, we evaluate the reconstruction performance by a quasi-SNR (signal-to-noise ratio):
where
x and
are original speech and the reconstructed speech, respectively. We reconstruct three speech excerpts, including an adult male’s voice, adult female’s voice and child’s voice, from the CS samples. Each speech excerpt is recorded using the cellphone in a quiet environment. The duration of each speech excerpt is 25 s.
Figure 6 shows the SNR under different undersampling ratios for the three speech excerpts. Each result is an average of 50 Monte Carlo experiments.
Obviously, as the undersampling ratio increases, the SNR increases too. In general, the reconstruction performance improves as the signal model becomes more accurate. While DCS-AMP and EM-GM-AMP offer better recovery performance than FISTA and MRF, the proposed algorithm achieves the best performance under nearly all undersampling ratios. However, when the undersampling ratio dropped to 0.2, the proposed algorithm was exceeded by DCS-AMP. This may be explained that when the measurement information about broadband signal is severely lacking, the estimate from CS samples using AMP is not accurate enough, so the belief propagation using the estimate does not improve the reconstruction performance. The SNR gap between the proposed algorithm and DCS-AMP is lager for the adult female’s voice than the adult male’s voice. This may be explained by the fact that the female’s pitch is higher than male, and more signal energy is located in higher frequency components. This energy distribution can be captured by the belief propagation along the frequency axis in the spectrogram. When more measurements are provided, the EM-GM-AMP exceeds the DCS-AMP for the adult female’s voice at undersampling ratio 0.4 or larger. This can be explained by the fact that an adult female’s voice changes quickly, so the temporal correlation is getting smaller. For the child’s voice, the difference in the SNR for the three algorithms (the proposed, DCS-AMP and EM-GM-AMP) is small especially when the undersampling ratio is large. This may be explained by the child being too young and his pronunciation not fluent. The Markov chain that models the temporal correlation in the proposed method and DCS-AMP does not work well. Despite this, the two methods still achieved higher SNR than EM-GM-AMP. MRF and FISTA algorithms achieved lower SNR than the aforementioned methods. MRF method models the marginal distribution of each coefficient as a Bernoulli–Gaussian. This distribution is compatible with the Ising model assumed on the support matrix. However, this marginal distribution function is too simplistic. The SNR is lower than other methods. FISTA method can be adjusted by a parameter . The larger the , the more sparse the solution. In the experiments, is set as 0.01. Since this method does not exploit the signal structure, the SNR gap between the first three methods and the FISTA gets larger as the undersampling ratio increases.
Second, we compare the spectrograms of reconstructed speeches by different algorithms.
Figure 7 shows the spectrogram of a speech excerpt, and the counterparts of reconstructed speech excerpts. The undersampling ratio is set as 1/3. It is obvious that the signal is not sparse enough, its energy is mainly located in low-frequency components. FISTA and EM-GM-AMP do not exploit the structure within the coefficient matrix and allocate too much energy to high-frequency components. MRF is better than these two algorithms from this aspect. In fact, the block sparse structure in the support matrix is captured by the Markov random field. The nonzero coefficients of recovered speech appeared in clusters. This method discards the sporadic coefficients and may distort subjective perception. The spectrograms obtained by DCS-AMP and the proposed algorithm are more similar to the original spectrogram than other algorithms. However, some differences exist. That is, the detail of the structure is more clearly demonstrated on the spectrogram of the reconstructed speech by the proposed algorithm. This means that the proposed algorithm can reduce the additive white Gaussian noise that exists in the original speech signal. This effect may be confirmed by the following experiment.
Next, we give a perceptual evaluation of speech quality (PESQ) [
22] scores for different algorithms in
Table 1. The PESQ score is used to evaluate the difference between the degraded speech and original speech and estimate the subjective mean opinion score. The higher the score, the better the quality. Here, the degraded speech is the recovered speech from CS samples. In the experiment, the original adult male’s speech and adult female’s speech are chosen from TIMIT acoustic–phonetic continuous speech corpus. The sampling frequency is 16 kHz. The original child’s speech is downsampled to 16 kHz too. The duration of each speech is 25 s. The undersampling ratio is set as 1/3. Each result is the mean of 50 Monte Carlo experiments. As the table shows, the proposed algorithm has achieved the highest PESQ score. There is little difference between the scores obtained by the DCS-AMP and EM-GM-AMP. In general, they have achieved suboptimal results. The FISTA method achieved a better score than the MRF method. This may be explained that the latter method discards some coefficients far away from the clusters described by the Markov random field. It is worth mentioning that the FISTA method has achieved a similar score to EM-GM-AMP method and DCS-AMP method when applied to the adult male’s voice. We speculate that the adult male has a lower pitch, and the most signal energy is located in the voiced sound which can be reconstructed well using conventional CS methods.
Finally, from the result in
Table 1, we can conclude that the proposed algorithm has an effect of speech enhancement. So far, we have reconstructed clean speech signals from the CS samples. Afterwards, we did a compressive sampling of the noisy speech and reconstructed it using the proposed method. The noisy speech signals were randomly selected from the NOIZEUS database, including five excerpts of a male voice indexed by 01–05 and five excerpts of a female voice indexed by 11–15. The noisy speech signals contain babble noise at different SNR. We computed the PESQ score and STOI (short-time objective intelligibility) score [
23] using the recovered speech and clean speech, which were not corrupted by any noise. Each result is the mean of 50 Monte Carlo experiments. The higher the score the better the enhancing effect. We make a comparison with a state-of-the-art speech enhancement algorithm [
24], which we refer to as the reference method. The undersampling ratio is set as 0.9 for both methods. The experimental results are shown in
Figure 8 and
Figure 9. The scores of the reference method are taken from the published paper. It can be seen from
Figure 8 that the proposed method achieves a comparable PESQ score to the reference method for the male voice. As the SNR of the noisy speech increases, the PESQ score of the recovered female voice approximates to the reference method gradually. It can be seen from
Figure 9 that as the SNR of the noisy speech increases, the proposed method achieves a higher STOI score than the reference method, both for the male voice and the female voice. Experimental results show that our method achieves comparable PESQ scores and STOI scores to the state-of-the-art speech enhancement algorithm.