Speech Enhancement Using U-Net with Compressed Sensing

Kang, Zheng; Huang, Zhihua; Lu, Chenhua

doi:10.3390/app12094161

Open AccessArticle

Speech Enhancement Using U-Net with Compressed Sensing

by

Zheng Kang

^1,2,

Zhihua Huang

^1,2,* and

Chenhua Lu

^1,2

¹

School of Information Science and Engineering, Xinjiang University, Urumqi 830017, China

²

Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi 830017, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(9), 4161; https://doi.org/10.3390/app12094161

Submission received: 3 March 2022 / Revised: 12 April 2022 / Accepted: 14 April 2022 / Published: 20 April 2022

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

With the development of deep learning, speech enhancement based on deep neural networks had made a great breakthrough. The methods based on U-Net structure achieved good denoising performance. However, part of them rely on ordinary convolution operation, which may ignore the contextual information and detailed features of input speech. To solve this issue, many studies have improved model performance by adding additional network modules, such as attention mechanism, long and short-term memory (LSTM), etc. In this work, therefore, a time-domain U-Net speech enhancement model which combines lightweight Shuffle Attention mechanism and compressed sensing loss (CS loss) is proposed. The time-domain dilated residual blocks are constructed and used for down-sampling and up-sampling in this model. The Shuffle Attention is added to the final output of the encoder for focusing on features of speech and suppressing irrelevant audio information. A new loss is defined by using the measurements of clean speech and enhanced speech based on compressed sensing, it can further remove noise in noisy speech. In the experimental part, the influence of different loss functions on model performance is proved through ablation experiments, and the effectiveness of CS loss is verified. Compared with the reference models, the proposed model can obtain higher speech quality and intelligibility scores with fewer parameters. When dealing with the noise outside the dataset, the proposed model still achieves good denoising performance, which proves that the proposed model can not only achieve a good enhancement effect, but also has good generalization ability.

Keywords:

speech enhancement; compressed sensing; deep learning; attention mechanism; U-Net

1. Introduction

Speech enhancement is very important in the field of speech processing. Its goal is to improve the quality and intelligibility of speech which disturbed by external noises. Speech enhancement is widely used in various applications, such as speech recognition, mobile communications, and hearing aids. Traditional speech enhancement algorithms include spectral-subtraction method [1,2], Wiener filtering [1,3] and subspace method [1], and so on.

Recently, with the development of deep learning, speech enhancement has made major breakthroughs in deep learning-based methods. These methods include convolutional neural network (CNN), recurrent neural network (RNN), etc. Among them, speech enhancement based on convolutional U-Net has also attracted much attention. Pascual et al. applied Generative Adversarial Network (GAN) for speech enhancement (SEGAN) [4], which used the generator based on U-Net and worked end-to-end in time-domain. However, the denoising performance of SEGAN was mediocre. In [5], a SEGAN+ was proposed that added learning factor to skip connections of generator of SEGAN for improving denoising performance. Iterative SEGAN (ISEGAN) and Deep SEGAN (DSEGAN) were proposed by using multiple U-Net based generators to extract the same and different speech features, respectively [6]. However, multiple generators increased the model complexity and the number of parameters. In [7], a self-attention mechanism was proposed and coupled with the (de)convolutional layers of generator (SASEGAN). It paid attention to the temporal contexts of the speech. In [8], a new loss was defined by using the speech feature maps, which were obtained from the discriminator. The U-Net based generator in [9] adopts Sinc convolution, focusing on more underlying speech characteristics (Sinc-SEGAN). Furthermore, the U-Net was used to enhance noisy speech in time-frequency domain without adversarial training [10,11]. They achieved a better enhancement effect, but the original noisy phase spectrum may affect denoising results. The Wave-U-Net [12] was also used to enhance speech in time-domain without adversarial training [13]. In addition [14], which is based on the structure of [13] added a local self-attention mechanism to each skip connection of U-Net. Ref.[15] proposed causal and non-causal U-Net models based on convolution and LSTM for enhancement (DEMUCS), and achieved comparable performance. The above researches show that U-Net has certain advantages in speech enhancement and deserves further exploration.

In addition, the Compressed Sensing (CS) [16] also provides a new research direction for speech enhancement [17,18,19,20]. CS has many constraints, and the sparse constraint on the signal is particularly important, but the general signal is only approximately sparse, which may lead to the degradation of reconstruction quality. Therefore, it is difficult to further improve the performance of CS. To solve these problems, Kabkab et al. [21] pointed out that the signal is sparse as the most common structural assumption, but not the only one. Therefore, much research proposed CS based on deep learning [21,22,23,24] and demonstrated the superiority of these methods in terms of signal sparsity and reconstruction. However, the application of CS based on deep learning in speech enhancement tasks still needs to be further explored.

On the one hand, the U-Net can achieve better performance in speech enhancement. On the other hand, CS based on deep learning can effectively improve the quality of signal reconstruction by optimizing errors between different measurements. Recently, ref.[25] proposed a lightweight Attention mechanism, which helps to improve the performance of deep convolutional neural networks. We, therefore, propose a time-domain U-Net speech enhancement model combining lightweight Shuffle Attention mechanism and CS loss. We use the U-Net structure with 5 coding layers and 5 decoding layers as in [15]. In the proposed model, the time-domain dilated residual blocks are constructed for down-sampling and up-sampling, they are used to focus on speech contextual information and detailed features during convolution. They are similar to the residual units applied in the time-frequency domain in [11]. Shuffle Attention is used to suppress irrelevant information after down-sampling to improve the enhancement effect. In addition, in the training stage, we are inspired by [22] to define a CS loss function using the measurement of speech, which can further remove noise in noisy speech.

In the experiment, we analyze the effectiveness of CS loss and compare the influence of different loss functions on the enhancement results through the ablation experiment. Experimental results show that the CS loss is helpful to improve enhancement performance. Furthermore, the proposed model can achieve higher speech quality and intelligibility with fewer parameters than existing U-Net-based models. Our model has only 32.99% of the parameters of the latest Sinc-SEGAN [9]. Compared with classical Attention-Wave-U-Net [14], the proposed model has improved 11.07%, 9.97%, 4.18%, and 10.70% on PESQ, CSIG, CBAK, and COVL, respectively. Furthermore, compared with causal DEMUCS [15], our model has higher SSNR, SAR and SDR scores, so its speech distortion is less. By removing the types of noise that are different from the open dataset, it is found that the proposed model can still achieve a good enhancement effect, indicating that the proposed model has good generalization performance.

2. U-Net and Compressed Sensing

2.1. Speech Enhancement Based on U-Net

In recent years, speech enhancement based on deep learning has made great progress, among which U-Net is a very classical model structure [11,12,13,14,15]. The essential components within this framework are encoder, decoder and skip connections. A brief overview of the model structure is shown in Figure 1. The encoder mainly down-samples the input speech to obtain higher level speech features. The decoder is the reverse process of the encoder, which up-samples speech feature maps and finally outputs the enhanced speech with the same length as the original input; The skip connections combine each encoding layer with its corresponding decoding layer. They help to retain the features of each encoding layer to the decoding layer. The transitional operations in Figure 1 can generally adopt modules such as convolution, LSTM and attention mechanism, etc., which can further pay attention to the characteristic information output by the encoder.

Part of U-Net-based speech enhancement models rely on ordinary convolution operations in the codec [4,5,6,8,15], the contextual information of the speech may be ignored and detailed features may be lost. To tackle these problems, a self-attention mechanism was added to the U-Net based generator to pay attention to the non-local detail features of speech [7]. Although the enhancement performance of this model was improved, the structure of attention was more complicated, which increased the complexities of the model. Macartney et al. [13] proposed Wave-U-Net for speech enhancement based on U-Net structure of audio source separation, which used linear interpolation [12]. Their experiments demonstrate the effectiveness of Wave-U-Net. However, high-level features of speech were extracted, while the detailed features were also ignored, during down and up-sampling. In [14], local self-attention was added to skip connections of U-Net to focus on the detailed features of each encoding layer, namely Attention-Wave-U-Net. It consisted of 12 encoding layers and 12 decoding layers. It achieved a high segmental SNR score, but the overall enhancement quality requires further study. In [15], the codec layer was reduced and a variety of data augmentation methods were used to achieve a good denoising effect, which also brings inspiration to this paper. However, the codec in [15] relies on ordinary convolution. In this paper, we aim to further improve the speech enhancement performance of U-Net under the premise of reducing parameters as much as possible. Therefore, we use 5 encoding layers and 5 decoding layers as in [15]. The time-domain dilated residual blocks are constructed to focus on the detailed features and speech context for down and up-sampling. A lightweight Shuffle Attention [25] is applied to the final output of the encoder. It focuses on speech feature information and suppresses the residual noise after down-sampling. The specific structure of our model will be introduced in detail in Section 3.1 and Section 3.2.

2.2. Principle of CS

Compressed sensing (CS) is a method of compressed sampling. It realizes data sampling at much lower than the Nyquist sampling frequency [16]. CS requires the signal to be sparse, and the measurements of signal are obtained through sparse and dimensionality reduction of original signal. Measurements can retain most of the information in the original signal with a very small amount of data. Equation (1) expresses the process of obtaining measurement

y \in ℝ^{M}

:

y = A x

(1)

where

x \in ℝ^{N}

is the original signal,

A \in ℝ^{M \times N}

, (M << N) is the measurement matrix. The core problem of CS is to reconstruct the original signal

x

from measurements. But it needs to make constraints on

x

. The most common one is that the original signal

x

is sparse [16], but it is not the only assumption [21,22]. The original signal

x

can be recovered from measurements by solving the optimization problem through a reconstruction algorithm [26,27] when measurement matrix

A

satisfies the Restricted Isometry Property (RIP). However, the general signal is only approximately sparse, which may degrade the quality of the reconstructed signal. Therefore, much research [21,22,23] used neural networks to constrain the signal structure instead of the sparsity constraint. The measurements of signal were used to define loss. These algorithms effectively solved problems caused by signal sparse in CS. At the same time, they proved that measurements can be used as an optimizable target to train the model and improved the quality of signal reconstruction. Based on the method of CS using generative models in [22], without relying on sparsity constraint, we define CS loss function using measurement matrix, clean speech and enhanced speech generated by the proposed U-Net model (detailed in Section 3.3). The aim is to further improve speech enhancement performance.

3. Proposed Model

3.1. Model Architecture

We build a speech enhancement model based on U-Net structure as shown in Figure 2. Our model directly enhances noisy speech in time-domain.

In this work, to pay attention to the speech contextual information and prevent the detailed features from being lost during down-sampling or up-sampling, we construct time-domain dilated residual blocks as shown in Res Block in Figure 2 for down-sampling and up-sampling in the encoder and decoder. When down-sampling or up-sampling, the input speech feature map is first fed a dilated convolution, in which kernel size is 8, stride is 1 and dilated factor is 2, and then a Batch normalization (BN) and rectified linear unit (ReLU) is performed. The number of filters of dilated convolution is equal to C_i or H_i. For example, as shown in Figure 2, in the Res_Block_2 of the encoder, the input channel of dilated convolution is 64, and the output channel (i.e., the number of filters) is 128. In the Res_Block_4 of the decoder, the input channel of dilated convolution is 128 and the output channel (i.e., the number of filters) is 64. Since the gated liner unit (GLU) [28] will split the input tensor along the specified dimension, we double channels of the speech feature map by one-dimensional convolution (1D-Conv), in which kernel size is 1, stride is 1 and the number of filters is 2 × C_i or 2 × H_i. As shown in Figure 2, in the Res_Block_2 of the encoder, the input channel of 1D-Conv on the left is 128 and the number of filters is 256. In the Res_Block_4 of the decoder, the input channel of 1D-Conv on the right is 64 and the number of filters is 128. Then, the output of 1D-Conv is activated by GLU along the channel axis and followed by a BN. Dilated convolution can expand receptive fields and aggregate context in speech enhancement. BN prevents gradient vanishing or exploding problems. Moreover, the channel of original input is adjusted by 1D-Conv, in which kernel size and stride are both 1, and the number of filters is C_i or H_i, and then BN is performed. The 1D-Conv ensures that the dimensions of the outputs on both sides are equal to perform an addition operation (denoted by a circled plus sign in Figure 2). Finally, the speech feature map is fed to a convolution (Conv) or transposed convolution (Trans-Conv), in which kernel size is 8, stride is 4 and the number of filters is C_i or H_i, and then activated by ReLU to obtain the output of the encoding layer or decoding layer. For instance, in Figure 2, in the second down-sampling, the number of input channels and filters in Conv are both 128, and in the fourth up-sampling, the number of input channels and filters in Trans-Conv are both 64. Only one block of the down-sampling and up-sampling operations is shown separately in Figure 2, except for the number of channels (i.e., C_i and H_i), the other layers use identical sampling operations.

3.2. Shuffle Attention Mechanism

The Attention mechanism can focus on the feature of speech and suppress irrelevant information. We apply a lightweight Shuffle Attention to the final output of the encoder [25]. It is used to focus on detailed features of speech and suppress irrelevant information after down-sampling. The structure is shown in Figure 3.

This attention mechanism divided the speech feature map into G groups along the channel dimension. Then the speech feature maps of each group are evenly divided into two sub-feature maps along the channel dimension. They serve as input of Channel Attention (C-Att) and Spatial Attention (S-Att), respectively. C-Att can focus on correlation between the various channels of speech sub-feature map, and S-Att can focus on detailed features of speech sub-feature map. In Figure 3,

x_{k 1}

and

x_{k 2}

are speech sub-feature maps after the k-th group of speech features (

x_{k}

) are split along channels.

x_{k 1}

and

x_{k 2}

are fed to C-Att and S-Att, respectively, as shown in Equations (2) and (3):

{\hat{x}}_{k 1} = σ (ω_{c} f_{G A P} (x_{k 1}) + b_{c}) \cdot x_{k 1}

(2)

{\hat{x}}_{k 2} = σ (ω_{s} g (x_{k 2}) + b_{s}) \cdot x_{k 2}

(3)

where

{\hat{x}}_{k 1}

and

{\hat{x}}_{k 2}

are sub-feature maps output by C-Att and S-Att, respectively. Concatenating them along channel axis to acquire attention output feature maps of the k-th group of speech features (i.e.,

{\hat{x}}_{k}

).

ω_{c}

,

ω_{s}

,

b_{c}

and

b_{s}

are the weights and biases of FC in C-Att and S-Att, respectively.

f_{G A P} (\cdot)

is GAP,

g (\cdot)

is group norm, and

σ (\cdot)

is sigmoid. Finally, the output features of attention of all groups are aggregated. In addition, the final output is obtained after the channel shuffle operation. The final output is the same size as the original input. The channel shuffle operation enables information fusion between different speech sub-features.

3.3. Loss Function

We assume

s

is the clean speech,

n

is the noise, so the noisy speech signal is

x = s + n

. The enhanced speech is set to

\hat{x}

.

Most enhancement models based on U-Net use Mean Squared Error (MSE) or Mean Absolute Error (MAE) between enhanced speech and clean speech [4,13,14,15]. However, MSE has poor robustness in dealing with abnormal points in the signal. Previous research has proved that MAE can significantly improve the denoising performance [14,15]. Therefore, we adopt the MAE of enhanced speech and clean speech as loss, as shown in Equation (4), where N is the number of samples of speech and

{∥ \cdot ∥}_{1}

is the L₁ norm.

L_{M A E} (s, \hat{x}) = \frac{1}{N} {∥ (\hat{x} - s) ∥}_{1} = \frac{1}{N} \sum_{i = 1}^{N} | {\hat{x}}_{i} - s_{i} |

(4)

In order to further remove noise in the contaminated speech signal, we define a new loss as shown in Equation (5) by using the measurements as the optimizable target, called CS loss, where N denotes the number of samples of measurements and

{∥ \cdot ∥}_{1}

is L₁ norm. The measurements of enhanced speech and clean speech are obtained directly by multiplying with the same measurement matrix

A

, without relying on sparsity constraint [22]. Measurements can retain the main information and even some unobvious features of original speech. Therefore, the enhancement performance of model can be improved by optimizing the MAE between measurements of clean speech and measurements of enhanced speech.

L_{C S} (s, \hat{x}) = \frac{1}{N} {∥ (A \hat{x} - A s) ∥}_{1}

(5)

Furthermore, we also optimize clean speech and enhanced speech in time-frequency domain, which helps our model to learn the time-frequency characteristics of speech. The time-frequency domain loss function applies a multi-resolution Short-Time Fourier Transform (STFT) loss function, which is the sum of STFT losses under different transform parameters [15,29]. The STFT loss is defined as follows:

L_{S T F T} (s, \hat{x}) = E_{\begin{matrix} s ~ p (s) \\ \hat{x} ~ p (\hat{x}) \end{matrix}} [\frac{{∥ | S T F T (s) | - | S T F T (\hat{x}) | ∥}_{F}}{{∥ | S T F T (s) | ∥}_{F}} + \frac{1}{N} ∥ l o g | S T F T (s) | - l o g {| S T F T (\hat{x}) | ∥}_{1}]

(6)

where

{∥ \cdot ∥}_{F}

denotes the Frobenius norm and

{∥ \cdot ∥}_{1}

is L₁ norm.

| S T F T (\cdot) |

stands for STFT magnitudes of speech. N is the number of elements in the magnitude. In Equation (6), the former is called spectral convergence loss, and the latter is called log STFT magnitude loss [29]. The multi-resolution STFT loss function can be expressed as shown in Equation (7), where M represents the number of STFT loss.

L_{M S T F T} (s, \hat{x}) = \frac{1}{M} \sum_{m = 1}^{M} L_{S T F T}^{(m)} (s, \hat{x})

(7)

The influence of loss function on the enhancement results will be compared through the ablation experiment in the experimental part.

4. Experiment Setup

4.1. Dataset

In the experiments, we used a public database presented by Valentini et al. in [30] (Valentini for short). This database consists of 30 speakers from the Voice Bank corpus [31], where 28 speakers were used for the training set and 2 for the test set. The training set has 11,572 clean-noisy utterances pairs. The noisy conditions include 10 noise types (8 noises from the DEMAND database [32] and 2 artificial noises) at signal-to-noise ratio (SNR) of 0, 5, 10 and 15 dB. There are 824 clean-noisy utterances pairs in the test set, including 5 noise types from the DEMAND database that are different from the training set [32], and the SNR is set to 2.5, 7.5, 12.5 and 17.5 dB.

Furthermore, to evaluate the denoising performance of our model under unseen noise in the Valentini dataset, we chose clean speech of 2 speakers from the clean Valentini test set [30], and the noise generated by the vibration of objects collected by special equipment. These objects, which were irradiated by special equipment, include laptop bag, briefcase, foam board, desk calendars, potted green dill, clothes and pillows. The SNR is set to −5, 0, 5 and 10 dB, and a total 560 noisy sentences (560-Testset, for short) were obtained for testing.

4.2. Experimental Parameters

The proposed model was trained for 150 epochs and used Adam optimizer [33] with the learning rate 3 × 10⁻⁴ and the parameter β₁ is 0.9, β₂ is 0.999. All the training and test utterances are sampled at 16 kHz. We trained model with batch size of 12. The channels per layer of the encoder (i.e., C_i in Figure 2) are 64, 128, 256, 512, 1024, respectively, and the channels per layer of the decoder (i.e., H_i in Figure 2) are 512, 256, 128, 64, 1. The preprocessing of speech before feeding it into model is similar to the method in [15]. Firstly, we set the length of the input single channel speech chunks as 1 s (i.e., 16,000 samples) and the overlap as 0.5 s (i.e., 50% of the input). Secondly, as previous works has proved that data augmentation of input speech was useful for the ability of model in denoising [9,14,15,34], we apply random shift between 0 s and 1 s for input speech. The number of samples of random shift is 8000, so the length of input speech after shift is about 0.5 s (i.e., 8000 samples). Finally, the standard deviation of input speech is used to normalize it before feeding it into model and scaling back the output of model by the same standard deviation. In addition, interpolated sampling had been proved to help improve the model’s denoising performance [14,15], so, sinc interpolation filter is used to resampling the input and output of the model [15,35] during training. The speech feature maps were divided into 64 groups (i.e., G is 64) along the channels in the Shuffle Attention. The measurements obtained by the random Gaussian matrix can reconstruct the better original signal with a higher probability, and it has universal applicability in the traditional CS. So, we chose a zero-mean random Gaussian matrix with independent and identically distributed entries as a measurement matrix

A

when defining CS loss. The rows of

A

are the measurement number, and the columns of

A

are equal to the length of the input speech after preprocessing. The multi-resolution STFT loss is applied during training (M is set to 3), where Hanning window is applied, the FFT sizes are {512, 1024, 2048}, the Window sizes are {240, 600, 1200}, and the Frame shift are {50, 120, 240} [29]. In the test phase, the test set will be enhanced directly without any preprocessing.

4.3. Competing Methods

In order to evaluate the effectiveness of our model, the comparable baseline models selected in this paper are trained with the Valentini dataset, and all reference methods are as follows:

Wiener filtering [3]: A traditional speech enhancement algorithm.
SEGAN [4]: A groundbreaking model of U-Net with GAN training in time-domain.
Wave-U-Net [13]: A time-domain U-Net enhancement model.
Attention-Wave-U-Net [14]: A time-domain U-Net with local self-attention.
DEMUCS (causal) [15]: A U-Net model with a unidirectional LSTM.
DSEGAN [6]: Multiple U-Net model with GAN training in time-domain.
SASEGAN [7]: Time-domain U-Net with GAN training combined with self-attention.
Sinc-SEGAN [9]: Sinc convolution was added to the U-Net based generator.

4.4. Evaluation Metrics

In the experiments, the quality metrics we employed are as follows: Perceptual Evaluation of Speech Quality (PESQ, wide-band version in ITU-T P.862.2, from −0.5 to 4.5) [36], Mean Opinion Score (MOS) prediction of speech distortion (CSIG, from 1 to 5) [37], MOS prediction of background noise interference (CBAK, from 1 to 5) [37], MOS prediction of overall effect (COVL, from 1 to 5) [37], Segmented SNR (SSNR) [38], and the Short-Time Objective Intelligibility (STOI) [39]. Note that all these six metrics are better if higher.

5. Experimental Results

5.1. Influence of Measurement Number

In the experimental stage, we first studied the influence of different measurement numbers (m, for short) on the enhanced performance of our model. The m is set as 1024, 2048 and 4096. The average scores of multiple experiments are shown in Table 1. The loss function used during training is

L_{a l p h a}

(detailed in 5.2). Our model is abbreviated as SECS-U-Net. The experimental results show that a similar enhancement effect can be obtained when different m is used to train the model. Although there are differences in scores, they have little effect on the overall enhancement performance. The reason is that the loss function in this paper includes different training objectives, and they will play a role in model training at the same time. Moreover, ref.[22] proved that when m is large, the reconstruction error tends to stabilize, although there is no additional improvement in performance. Therefore, the m is set as 4096 in subsequent studies.

5.2. Ablation Experiment

In our experimental stage, in order to evaluate the impact of CS loss on performance, we study the impact of different training objectives on the model enhancement results and record the average scores of each indicator for multiple experiments in Table 2. The loss adopted is indicated in parentheses. Both

L_{a l l}

and

L_{a l p h a}

said that the three training objectives were combined for training, which were, respectively, defined as

L_{a l l} = L_{M A E} + L_{M S T F T} + L_{C S}

and

L_{a l p h a} = α (L_{M A E} + L_{M S T F T}) + (1 - α) L_{C S}

. Where

α

is a weight factor to balance the contribution of each loss and set it to 0.8. Furthermore, the “No Att” means that the Shuffle Attention [25] is not used in our model, and “No Res” means that the Res_Block which we constructed is not used in the encoder and decoder. We only use ordinary convolution and transposed convolution for down-sampling and up-sampling.

As in Table 2, no matter what type of loss is adopted, the proposed model can achieve better enhancement performance. Compared with SECS-U-Net (

L_{M A E}

), SECS-U-Net (

L_{C S}

) and SECS-U-Net (

L_{M S T F T}

), it is found that

L_{M S T F T}

can greatly improve the scores of PESQ, CSIG, CBAK and COVL, but the SSNR is lower, which is 8.70. It is proved that

L_{M S T F T}

can help our model to learn features of time-frequency and retain the speech information when training model in time-domain, but there may be residual noise in enhanced speech, which makes the SSNR lower. However,

L_{M A E}

and

L_{C S}

have better performance in improving SSNR.

SECS-U-Net (

L_{M A E} + L_{C S}

), SECS-U-Net (

L_{M A E} + L_{M S T F T}

) and SECS-U-Net (

L_{C S} + L_{M S T F T}

), respectively, adopt the combination of two different losses as the training objective of our model. When

L_{M A E} + L_{C S}

was used, the SSNR is 9.43, and other indicators are inferior to the results of

L_{M A E} + L_{M S T F T}

and

L_{C S} + L_{M S T F T}

, which further indicated that

L_{M S T F T}

is helpful to improve the denoising effect of model. The SSNR of SECS-U-Net (

L_{M A E} + L_{M S T F T}

) and SECS-U-Net (

L_{C S} + L_{M S T F T}

) are 9.13 and 9.51, respectively. For other indicators, the enhancement performance of SECS-U-Net (

L_{M A E} + L_{M S T F T}

) was slightly better than that of SECS-U-Net (

L_{C S} + L_{M S T F T}

). It indicates that when

L_{M S T F T}

is combined with

L_{M A E}

or

L_{C S}

as the training target, SSNR can be improved without degrading other speech evaluation criteria. It is also proved that

L_{M A E}

is better than

L_{C S}

in improving speech quality, but

L_{C S}

is slightly better than

L_{M A E}

in removing residual noise and improving SSNR of speech. Since all three training objectives help to improve the overall performance of our model, a compound objective is used to train model. It can not only enable our model to learn time-frequency characteristics, but also retain speech information and effectively suppress residual noise. Comparing

L_{a l l}

with

L_{a l p h a}

, the enhancement performance of SECS-U-Net (

L_{a l p h a}

) is slightly better than SECS-U-Net (

L_{a l l}

).

The comparison between SECS-U-Net (

L_{a l p h a}, N o A t t

) and SECS-U-Net (

L_{a l p h a}

) shows that the attention mechanism is helpful to improve the scores of CBAK and SSNR, indicating that it can effectively suppress residual noise and improve the denoising ability of SECS-U-Net. In addition, comparing SECS-U-Net (

L_{a l p h a}, N o R e s

) with SECS-U-Net (

L_{a l p h a}

) shows that when Res_Blocks is not used in the codec, the enhancement quality is degraded, and the PESQ is only 2.54. So, it shows that the Res_Blocks which we constructed are effective in improving the enhancement performance.

5.3. Metrics Comparisons of Different Methods

Table 3 shows the scores of speech quality metrics in the SECS-U-Net and the reference models on the open test set [30]. In Table 3, we reproduced Wiener [3], SEGAN [4], and SASEGAN [7]. The scores of other models from their papers. The scores of our model are the mean of multiple experiments. The experimental results show that the causal DEMUCS achieved the best results among all the referenced models. Compared to it, SECS-U-Net (

L_{a l p h a}

) achieves similar PESQ and STOI scores of 2.91 and 0.947, but other indicators were better than it. Compared with the latest Sinc-SEGAN [9], SECS-U-Net (

L_{a l p h a}

) acquires relative improvements of 11.11% and 14.92% over it on CSIG and COVL, respectively. In addition, compared with the classical Attention-Wave-U-Net [14], the proposed model has improved by 11.07%, 9.97%, 4.18% and 10.70% in PESQ, CSIG, CBAK and COVL, respectively. These results in Table 3 suggest that our model can effectively improve the quality and intelligibility of noisy speech, and the overall enhancement performance of the proposed model is better than referenced models.

The metrics in Table 3 are objective evaluation indexes. The signal-to-artifacts ratio (SAR) and signal-to-distortion ratio (SDR) measures in [40] are selected to further evaluate the subjective quality of enhanced speech. Note that all these two metrics are better if higher. The average SSNR, SDR and SAR scores of our model and other reference models are recorded in Table 4. The experimental results show that Attention-Wave-U-Net achieves the best scores, although SECS-U-Net (

L_{a l p h a}

) is not as good as it, it can be found from Table 3 that the overall performance of our model is superior to it. SECS-U-Net (

L_{a l p h a}

) achieves better SAR and SDR scores of 19.92 and 18.91, respectively, compared with the latest DEMUCS (causal) and SASEGAN. In terms of SSNR, SECS-U-Net (

L_{a l p h a}

) acquires relative improvements of 11.84% over DEMUCS (causal) and 13.57% over SASEGAN. Table 4 shows that our model has less speech distortion.

In Figure 4, the enhancement results of different models are further demonstrated by spectrograms. The results show that different models can effectively remove noise and achieve a good enhancement effect. However, by comparison, the proposed model has less residual noise and retains more details of speech, which further indicates that our model has less distortion and can achieve better enhancement performance.

5.4. Parameter Comparisons of Different Methods

The number of parameters of our proposed model is compared with other referenced models. Figure 5 presents a histogram of the parameters of the models. By comparing the results of histogram, it can be found that the parameters of our model are 30.02 million. Although the parameters of the classic Attention-Wave-U-Net [14] and the causal DEMUCS [15] are less than that of SECS-U-Net, the scores of metrics in Table 3 and Table 4 show that their overall enhancement performance is inferior to SECS-U-Net. Furthermore, it also indicates that the parameters of our model only account for 29.99% of that of the recently proposed SASEGAN [7]. In addition, compared with the latest Sinc-SEGAN [9], the parameters of SECS-U-Net are 32.99% of its. In total, the results in Figure 5 show that our proposed model achieves better denoising performance with fewer parameters.

5.5. Generalization Abilities Comparisons of Different Methods

In order to evaluate the generalization ability of our model, Table 5 recorded the average evaluation scores of Attention-Wave-U-Net [14], SASEGAN [7], and SECS-U-Net (

L_{a l p h a}

) when removing unseen noise. The three models are all trained by the Valentini training set [30], and tested under the 560-Testset, in which noise types are different from in DEMAND [32]. According to the scores of each metric in Table 5, the three models can achieve good denoising performance when dealing with unseen noise. Attention-Wave-U-Net has better SSNR scores, but other indicators are inferior to SECS-U-NET (

L_{a l p h a}

), indicating that our model achieves good overall performance when dealing with unseen noise.

Figure 6 shows the spectrograms of enhanced speech generated from the above three models when the noise is outside the training set and SNR is -5dB. The results show that all of them can achieve a good denoising. However, the proposed model loses less speech information and retains more speech details.

6. Conclusions

This paper proposed a time-domain U-Net structure for speech enhancement that applies CS loss and combines the Shuffle Attention mechanism. This attention can focus on the speech detailed features of the final output of the encoder and suppress irrelevant information. According to the CS based on deep learning, CS loss optimizes the error between measurements of clean speech and measurements of enhanced speech, which can improve the model enhancement performance. Furthermore, we construct the time-domain dilated residual blocks for down-sampling and up-sampling, it can capture the speech contextual information and detailed features. In the experimental part, the influence of different loss functions on the enhancement results was compared through ablation experiments. The experimental results also show that CS loss can be used as a training objective to train our model and achieve better denoising performance. Furthermore, the time-frequency loss can greatly improve the model enhancement performance. Compared with reference models trained under the same dataset, the proposed model can obtain higher speech quality scores with fewer parameters. When dealing with the noise outside the open dataset, the proposed model can also achieve better overall denoising performance. All the experimental analyses prove that our model can not only achieve a good enhancement effect but also has certain generalization abilities. In the future work, we will further study the impact of CS loss applied to the time-frequency domain speech enhancement model, and study how to further improve the codec in U-Net to improve the speech enhancement performance.

Author Contributions

Conceptualization, Z.K. and Z.H.; methodology, Z.K.; software, Z.K. and C.L.; validation, Z.K. and C.L.; formal analysis, Z.K.; writing—original draft preparation, Z.K.; writing—review and editing, Z.H.; supervision, Z.H.; funding acquisition, Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Key R&D Program of China, grant number 2018YFC0823402, the Natural Science Foundation of Xinjiang Uygur Autonomous Region of China grant number 2017D01C044.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The public dataset [30] used to evaluate our model is permanently available at: http://dx.doi.org/10.7488/ds/1356, accessed on 5 June 2021.

Acknowledgments

The authors would like to thank the authors of [30] for the public dataset, as well as the editors and reviewers for their valuable comments.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LSTM	long and short-term memory
CS loss	compressed sensing loss
GAN	Generative Adversarial Network
SEGAN	speech enhancement generative adversarial network
CS	Compressed Sensing
RIP	Restricted Isometry Property
1D-Conv	one-dimensional convolution
MSE	Mean Squared Error
MAE	Mean Absolute Error
STFT	Short-Time Fourier Transform
SNR	signal-to-noise ratio
PESQ	Perceptual Evaluation of Speech Quality
MOS	Mean opinion score
CSIG	MOS prediction of speech distortion
CBAK	MOS prediction of background noise interference
COVL	MOS prediction of overall effect
SSNR	Segmented SNR
STOI	Short-Time Objective Intelligibility

References

Loizou, P. Speech Enhancement: Theory and Practice, 2nd ed.; CRC Press: Boca Raton, FL, USA, 2013. [Google Scholar]
Yang, L.-P.; Fu, Q.-J. Spectral Subtraction-Based Speech Enhancement for Cochlear Implant Patients in Background Noise. J. Acoust. Soc. Am. 2005, 117, 1001–1004. [Google Scholar] [CrossRef] [PubMed]
Scalart, P.; Filho, J.V. Speech Enhancement Based on a Priori Signal to Noise Estimation. In Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, GA, USA, 9 May 1996; Volume 2, pp. 629–632. [Google Scholar]
Pascual, S.; Bonafonte, A.; Serrà, J. SEGAN: Speech Enhancement Generative Adversarial Network. arXiv 2017, arXiv:1703.09452. [Google Scholar]
Pascual, S.; Serrà, J.; Bonafonte, A. Time-Domain Speech Enhancement Using Generative Adversarial Networks. Speech Commun. 2019, 114, 10–21. [Google Scholar] [CrossRef]
Phan, H.; McLoughlin, I.V.; Pham, L.; Chen, O.Y.; Koch, P.; De Vos, M.; Mertins, A. Improving GANs for Speech Enhancement. IEEE Signal Process. Lett. 2020, 27, 1700–1704. [Google Scholar] [CrossRef]
Phan, H.; Le Nguyen, H.; Chén, O.Y.; Koch, P.; Duong, N.Q.; McLoughlin, I.; Mertins, A. Self-Attention Generative Adversarial Network for Speech Enhancement. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 7103–7107. [Google Scholar]
Yang, F.; Li, J.; Yan, Y. A New Method for Improving Generative Adversarial Networks in Speech Enhancement. In Proceedings of the 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), Hong Kong, China, 24 January 2021; pp. 1–5. [Google Scholar]
Li, L.; Kürzinger, L.; Watzel, T.; Rigoll, G. Lightweight End-to-End Speech Enhancement Generative Adversarial Network Using Sinc Convolutions. Appl. Sci. 2021, 11, 7564. [Google Scholar] [CrossRef]
Geng, C.; Wang, L. End-to-End Speech Enhancement Based on Discrete Cosine Transform. In Proceedings of the 2020 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), Dalian, China, 27–29 June 2020; pp. 379–383. [Google Scholar]
Deng, F.; Jiang, T.; Wang, X.-R.; Zhang, C.; Li, Y. NAAGN: Noise-Aware Attention-Gated Network for Speech Enhancement. In Proceedings of the Interspeech 2020, Shanghai, China, 25 October 2020; pp. 2457–2461. [Google Scholar]
Stoller, D.; Ewert, S.; Dixon, S. Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation. arXiv 2018, arXiv:1806.03185. [Google Scholar]
Macartney, C.; Weyde, T. Improved Speech Enhancement with the Wave-U-Net. arXiv 2018, arXiv:1811.11307. [Google Scholar]
Giri, R.; Isik, U.; Krishnaswamy, A. Attention Wave-U-Net for Speech Enhancement. In Proceedings of the 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 20–23 October 2019; pp. 249–253. [Google Scholar]
Défossez, A.; Synnaeve, G.; Adi, Y. Real Time Speech Enhancement in the Waveform Domain. In Proceedings of the Interspeech 2020, Shanghai, China, 25 October 2020; pp. 3291–3295. [Google Scholar]
Donoho, D.L. Compressed Sensing. IEEE Trans. Inf. Theory 2006, 52, 1289–1306. [Google Scholar] [CrossRef]
Haneche, H.; Boudraa, B.; Ouahabi, A. Speech Enhancement Using Compressed Sensing-Based Method. In Proceedings of the 2018 International Conference on Electrical Sciences and Technologies in Maghreb (CISTEM), Algiers, Algeria, 28–31 October 2018; pp. 1–6. [Google Scholar]
Sridhar, K.V.; Kishore Kumar, T. Performance Evaluation of CS Based Speech Enhancement Using Adaptive and Sparse Dictionaries. In Proceedings of the 2019 4th International Conference and Workshops on Recent Advances and Innovations in Engineering (ICRAIE), Kedah, Malaysia, 27–29 November 2019; pp. 1–7. [Google Scholar]
Haneche, H.; Boudraa, B.; Ouahabi, A. A New Way to Enhance Speech Signal Based on Compressed Sensing. Measurement 2020, 151, 107117. [Google Scholar] [CrossRef]
Wang, J.-C.; Lee, Y.-S.; Lin, C.-H.; Wang, S.-F.; Shih, C.-H.; Wu, C.-H. Compressive Sensing-Based Speech Enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 2122–2131. [Google Scholar] [CrossRef]
Kabkab, M.; Samangouei, P.; Chellappa, R. Task-Aware Compressed Sensing with Generative Adversarial Networks. arXiv 2018, arXiv:1802.01284. [Google Scholar]
Bora, A.; Jalal, A.; Price, E.; Dimakis, A.G. Compressed Sensing Using Generative Models. arXiv 2017, arXiv:1703.03208. [Google Scholar]
Wu, Y.; Rosca, M.; Lillicrap, T. Deep Compressed Sensing. arXiv 2019, arXiv:1905.06723. [Google Scholar]
Xu, S.; Zeng, S.; Romberg, J. Fast Compressive Sensing Recovery Using Generative Models with Structured Latent Variables. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 2967–2971. [Google Scholar] [CrossRef] [Green Version]
Zhang, Q.-L.; Yang, Y.-B. SA-Net: Shuffle Attention for Deep Convolutional Neural Networks. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6 June 2021; pp. 2235–2239. [Google Scholar]
Tropp, J.A.; Gilbert, A.C. Signal Recovery from Random Measurements Via Orthogonal Matching Pursuit. IEEE Trans. Inform. Theory 2007, 53, 4655–4666. [Google Scholar] [CrossRef] [Green Version]
Donoho, D.L.; Tsaig, Y.; Drori, I.; Starck, J.-L. Sparse Solution of Underdetermined Systems of Linear Equations by Stagewise Orthogonal Matching Pursuit. IEEE Trans. Inform. Theory 2012, 58, 1094–1121. [Google Scholar] [CrossRef]
Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language Modeling with Gated Convolutional Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Precup, D., Teh, Y.W., Eds.; Volume 70, pp. 933–941. [Google Scholar]
Yamamoto, R.; Song, E.; Kim, J.-M. Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6199–6203. [Google Scholar]
Valentini-Botinhao, C.; Wang, X.; Takaki, S.; Yamagishi, J. Investigating RNN-Based Speech Enhancement Methods for Noise-Robust Text-to-Speech. In Proceedings of the 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), Sunnyvale, CA, USA, 13–15 September 2016; pp. 146–152. [Google Scholar]
Veaux, C.; Yamagishi, J.; King, S. The Voice Bank Corpus: Design, Collection and Data Analysis of a Large Regional Accent Speech Database. In Proceedings of the 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), Gurgaon, India, 25–27 November 2013; pp. 1–4. [Google Scholar]
Thiemann, J.; Ito, N.; Vincent, E. The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings. In Proceedings of Meetings on Acoustics ICA2013; Acoustical Society of America: Montreal, QC, Canada, 2013; Volume 19, p. 35081. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. Interspeech 2019, 2019, 2613–2617. [Google Scholar] [CrossRef] [Green Version]
Smith, J.; Gossett, P. A Flexible Sampling-Rate Conversion Method. In Proceedings of the ICASSP ’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, San Diego, CA, USA, 19–21 March 1984; Institute of Electrical and Electronics Engineers: San Diego, CA, USA, 1984; Volume 9, pp. 112–115. [Google Scholar]
ITU-T. P. 862.2: Wideband Extension to Recommendation P. 862 for the Assessment of Wideband Telephone Networks and Speech Codecs; International Telecommunication Union, CH: Geneva, Switzerland, 2005. [Google Scholar]
Hu, Y.; Loizou, P.C. Evaluation of Objective Quality Measures for Speech Enhancement. IEEE Trans. Audio Speech Lang. Process. 2008, 16, 229–238. [Google Scholar] [CrossRef]
Hansen, J. An effective quality evaluation protocol for speech enhancement algorithms. In Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP 98), Sydney, Australia, 30 November–4 December 1998. [Google Scholar]
Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2125–2136. [Google Scholar] [CrossRef]
Vincent, E.; Gribonval, R.; Fevotte, C. Performance Measurement in Blind Audio Source Separation. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 1462–1469. [Google Scholar] [CrossRef] [Green Version]

Figure 1. The architecture of U-Net.

Figure 2. Proposed model architecture, with 5 coding layers, 5 decoding layers, and 1 Shuffle Attention. IN represents the channel of input speech feature map in each layer, C_i and H_i, respectively, represent the channel of speech feature map after down-sampling and up-sampling at the i-th layer.

Figure 3. Shuffle Attention Mechanism, with Channel Attention (C-Att) and Spatial Attention (S-Att). GAP represents global average pooling, and FC represents Full Connection (i.e.,

ω x + b

). Channel shuffle is similar to reshape operation.

Figure 3. Shuffle Attention Mechanism, with Channel Attention (C-Att) and Spatial Attention (S-Att). GAP represents global average pooling, and FC represents Full Connection (i.e.,

ω x + b

). Channel shuffle is similar to reshape operation.

Figure 4. The spectrograms of speech enhanced by (c) Attention-Wave-U-Net [14], (d) DEMUCS (causal) [15], (e) SASEGAN [7] and (f) SECS-U-Net (

L_{a l p h a}

)on open test set. The (a) and (b) are spectrograms of clean and noisy speech, respectively.

Figure 4. The spectrograms of speech enhanced by (c) Attention-Wave-U-Net [14], (d) DEMUCS (causal) [15], (e) SASEGAN [7] and (f) SECS-U-Net (

L_{a l p h a}

)on open test set. The (a) and (b) are spectrograms of clean and noisy speech, respectively.

Figure 5. Comparison of parameters of SECS-U-Net and other reference methods.

Figure 6. The spectrograms of speech enhanced by (c) Attention-Wave-U-Net [14], (d) SASEGAN [7] and (e) SECS-U-Net (

L_{a l p h a}

) on 560-Testset. The (a) and (b) are spectrograms of clean and noisy speech, respectively.

Figure 6. The spectrograms of speech enhanced by (c) Attention-Wave-U-Net [14], (d) SASEGAN [7] and (e) SECS-U-Net (

L_{a l p h a}

) on 560-Testset. The (a) and (b) are spectrograms of clean and noisy speech, respectively.

Table 1. Influence of different measurement number.

Method	PESQ	CSIG	CBAK	COVL	SSNR	STOI
Noisy	1.97	3.34	2.44	2.63	1.68	0.916
SECS-U-Net ( $L_{a l p h a}$ , 1024)	2.92	4.30	3.50	3.63	9.62	0.948
SECS-U-Net ( $L_{a l p h a}$ , 2048)	2.90	4.28	3.48	3.61	9.49	0.947
SECS-U-Net ( $L_{a l p h a}$ , 4096)	2.91	4.30	3.49	3.62	9.54	0.947

Table 2. Ablation experiment of different loss.

Method	PESQ	CSIG	CBAK	COVL	SSNR	STOI
Noisy	1.97	3.34	2.44	2.63	1.68	0.916
SECS-U-Net ( $L_{M A E}$ )	2.69	3.96	3.35	3.32	9.59	0.942
SECS-U-Net ( $L_{C S}$ )	2.63	3.84	3.32	3.24	9.68	0.942
SECS-U-Net ( $L_{M S T F T}$ )	2.91	4.33	3.44	3.64	8.70	0.947
SECS-U-Net ( $L_{M A E} + L_{C S}$ )	2.65	3.92	3.32	3.28	9.43	0.942
SECS-U-Net( $L_{M A E} + L_{M S T F T}$ )	2.90	4.31	3.47	3.63	9.13	0.947
SECS-U-Net ( $L_{C S} + L_{M S T F T}$ )	2.84	4.22	3.45	3.54	9.51	0.947
SECS-U-Net ( $L_{a l l}$ )	2.87	4.25	3.47	3.58	9.58	0.947
SECS-U-Net ( $L_{a l p h a}$ )	2.91	4.30	3.49	3.62	9.54	0.947
SECS-U-Net ( $L_{a l p h a}, N o A t t$ )	2.92	4.30	3.44	3.63	8.80	0.946
SECS-U-Net ( $L_{a l p h a}, N o R e s$ )	2.54	3.96	3.28	3.25	9.38	0.938

Table 3. Objective evaluation results of the proposed method and the reference methods (the bold indicates our metric scores and the metric scores that exceed ours in the reference models).

Method	PESQ	CSIG	CBAK	COVL	STOI
Noisy	1.97	3.34	2.44	2.63	0.916
Wiener [3]	2.22	3.23	2.68	2.67	0.914
SEGAN [4]	2.24	3.47	2.93	2.84	0.931
Wave-U-Net [13]	2.40	3.52	3.24	2.96	/
Attention-Wave-U-Net [14]	2.62	3.91	3.35	3.27	/
DEMUCS (causal) [15]	2.93	4.22	3.25	3.52	0.950
DSEGAN [6]	2.35	3.55	3.10	2.93	0.933
SASEGAN [7]	2.41	3.62	3.06	2.99	0.935
Sinc-SEGAN [9]	2.86	3.87	3.66	3.15	0.950
SECS-U-Net ( $L_{a l p h a}$ )	2.91	4.30	3.49	3.62	0.947

Table 4. SSNR, SAR and SDR of the proposed method and the reference methods (the bold indicates our metric scores and the metric scores that exceed ours in the reference models).

Method	SSNR	SAR	SDR
Noisy	1.68	8.90	9.07
SEGAN [4]	7.15	17.19	16.05
Attention-Wave-U-Net [14]	10.05	20.64	19.24
DEMUCS (causal) [15]	8.53	18.34	17.25
SASEGAN [7]	8.40	19.85	17.91
SECS-U-Net ( $L_{a l p h a}$ )	9.54	19.92	18.91

Table 5. Objective evaluation results of different models under unseen noise in open dataset (the bold indicates our metric scores and the metric scores that exceed ours in the reference models).

Metric		PESQ	CSIG	CBAK	COVL	STOI	SSNR
Method	SNR	PESQ	CSIG	CBAK	COVL	STOI	SSNR
Noisy	−5 dB	1.10	2.01	1.42	1.48	0.805	−5.91
	0 dB	1.22	2.36	1.68	1.73	0.852	−3.64
	5 dB	1.44	2.79	2.02	2.08	0.903	−0.87
	10 dB	1.82	3.30	2.46	2.54	0.939	2.45
	Average	1.40	2.62	1.90	1.96	0.875	−1.99
Attention- Wave-U-Net [14]	−5 dB	1.47	2.69	2.47	2.05	0.861	6.02
	0 dB	1.76	3.07	2.71	2.39	0.885	7.30
	5 dB	2.03	3.39	2.95	2.69	0.917	8.73
	10 dB	2.37	3.78	3.23	3.07	0.943	10.09
	Average	1.91	3.23	2.84	2.55	0.902	8.04
SASEGAN [7]	−5 dB	1.32	2.36	2.23	1.79	0.829	4.00
	0 dB	1.52	2.60	2.46	2.02	0.868	5.77
	5 dB	1.85	3.00	2.77	2.40	0.910	7.65
	10 dB	2.23	3.40	3.07	2.80	0.940	9.07
	Average	1.73	2.84	2.63	2.25	0.887	6.62
SECS-U-Net $(L_{a l p h a})$	−5 dB	1.61	2.88	2.49	2.23	0.862	4.61
	0 dB	1.92	3.17	2.76	2.54	0.885	6.22
	5 dB	2.22	3.54	3.03	2.88	0.923	8.01
	10 dB	2.61	3.99	3.36	3.31	0.950	9.88
	Average	2.09	3.40	2.91	2.74	0.905	7.18

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kang, Z.; Huang, Z.; Lu, C. Speech Enhancement Using U-Net with Compressed Sensing. Appl. Sci. 2022, 12, 4161. https://doi.org/10.3390/app12094161

AMA Style

Kang Z, Huang Z, Lu C. Speech Enhancement Using U-Net with Compressed Sensing. Applied Sciences. 2022; 12(9):4161. https://doi.org/10.3390/app12094161

Chicago/Turabian Style

Kang, Zheng, Zhihua Huang, and Chenhua Lu. 2022. "Speech Enhancement Using U-Net with Compressed Sensing" Applied Sciences 12, no. 9: 4161. https://doi.org/10.3390/app12094161

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Speech Enhancement Using U-Net with Compressed Sensing

Abstract

1. Introduction

2. U-Net and Compressed Sensing

2.1. Speech Enhancement Based on U-Net

2.2. Principle of CS

3. Proposed Model

3.1. Model Architecture

3.2. Shuffle Attention Mechanism

3.3. Loss Function

4. Experiment Setup

4.1. Dataset

4.2. Experimental Parameters

4.3. Competing Methods

4.4. Evaluation Metrics

5. Experimental Results

5.1. Influence of Measurement Number

5.2. Ablation Experiment

5.3. Metrics Comparisons of Different Methods

5.4. Parameter Comparisons of Different Methods

5.5. Generalization Abilities Comparisons of Different Methods

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI