**1. Introduction**

Speech enhancement is essential for various speech applications such as robust speech recognition, hearing aids, and mobile communications [1–4]. The main objective of speech enhancement is to improve the quality and intelligibility of the noisy speech by suppressing the background noise or interferences.

In the early studies on speech enhancement, the minimum mean-square error (MMSE)- based spectral amplitude estimator algorithms [5,6] were popular producing enhanced signal with low residual noise. However, the MMSE-based methods have been reported ineffective in non-stationary noise environments due to their stationarity assumption on speech and noise. An effective way to deal with the non-stationary noise is to utilize a priori information extracted from a speech or noise database (DB), called the template-based speech enhancement techniques. One of the most well-known template-based schemes is the non-negative matrix factorization (NMF)-based speech enhancement technique [7,8]. NMF is a latent factor analysis technique to discover the underlying part-based nonnegative representations of the given data. Since there is no strict assumption on the speech and noise distributions, the NMF-based speech enhancement technique shows robustness to non-stationary noise environments. Since, however, the NMF-based algorithm assumes that all data is described as a linear combination of finite bases, it is known to suffer from speech distortion not covered by this representational form.

In the past few years, deep neural network (DNN)-based speech enhancement has received tremendous attention due to its ability to model complex mappings [9–12]. These methods map the noisy spectrogram to the clean spectrogram via the neural networks such as the convolutional neural network (CNN) [11] or recurrent neural network

**Citation:** Kim, H.Y.; Yoon, J.W.; Cheon, S.J.; Kang, W.H.; Kim, N.S. A Multi-Resolution Approach to GAN-Based Speech Enhancement. *Appl. Sci.* **2021**, *11*, 721. https://doi.org/10.3390/app 11020721

Received: 2 December 2020 Accepted: 10 January 2021 Published: 13 January 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

(RNN) [12]. Although the DNN-based speech enhancement techniques have shown promising performance, most of the techniques typically focus on modifying the magnitude spectra. This could cause a phase mismatch between the clean and enhanced speech since the DNN-based speech enhancement methods usually reuse the noisy phase for waveform reconstruction. For this reason, there has been growing interest in phase-aware speech enhancement [13–15] that exploits the phase information during the training and reconstruction. To circumvent the difficulty of the phase estimation, end-to-end (E2E) speech enhancement technique which directly enhances noisy speech waveform in the time domain has been developed [16–18]. Since the E2E speech enhancement techniques are performed in a waveform-to-waveform manner without any consideration of the spectra, their performance is not dependant on the accuracy of the phase estimation.

The E2E approaches, however, rely on a distance-based loss functions between the time-domain waveforms. Since these distance-based costs do not take human perception into account, the E2E approaches are not guaranteed to achieve good human-perceptionrelated metrics, e.g., the perceptual evaluation of speech quality (PESQ) [19], short-time objective intelligibility (STOI) [20], and etc. Recently, generative adversarial network (GAN) [21]-based speech enhancement techniques have been developed to overcome the limitation of the distance-based costs [22–26]. Adversarial losses of GAN provide an alternative objective function to reflect the human auditory property, which can make the distribution of the enhanced speech close to that of the clean speech. To our knowledge, SEGAN [22] was the first attempt to apply GAN to the speech enhancement task, which used the noisy speech as a conditional information for a conditional GAN (cGAN) [27]. In [26], an approach to replace a vanilla GAN with advanced GAN, such as Wasserstein GAN (WGAN) [28] or relativistic standard GAN (RSGAN) [29] was proposed based on the SEGAN framework.

Even though the GAN-based speech enhancement techniques have been found successful, there still remain two important issues: (1) training instability and (2) a lack in considering various speech characteristics. Since GAN aims at finding the Nash equilibrium to solve a mini-max problem, it has been known that the training is usually unstable. A number of efforts have been devoted to stabilize the GAN training in image processing, by modifying the loss function [28] or the generator and discriminator structures [30,31]. However, in speech processing, this problem has not been extensively studied yet. Moreover, since most of the GAN-based speech enhancement techniques directly employ the models used in image generation, it is necessary to modify them to suit the inherent nature of speech. For instance, the GAN-based speech enhancement techniques [22,24,26] commonly used U-Net generator originated from image processing tasks. Since the U-net generator consisted of multiple CNN layers, it was insufficient to reflect the temporal information of speech signal. In regression-based speech enhancement, the modified U-net structure adding RNN layers to capture the temporal information of speech showed prominent performances [32]. In [33] for the speech synthesis, an alternative loss function depended on multiple sizes of window length and fast Fourier transform (FFT) was proposed and generated a good quality of speech, which also considered speech characteristics in frequency domain.

In this paper, we propose novel generator and discriminator structures for the GANbased speech enhancement which reflect the speech characteristics while ensuring stable training. The conventional generator is trained to find a mapping function from the noisy speech to the clean speech by using sequential convolution layers, which is considered an ineffective approach especially for speech data. In contrast, the proposed generator progressively estimates the wide frequency range of the clean speech via a novel upsampling layer.

In the early stage of GAN training, it is too easy for the conventional discriminator to differentiate real samples from fake samples for high-dimensional data. This often lets GAN fail to reach the equilibrium point due to vanishing gradient [30]. To address this issue, we propose a multi-scale discriminator that is composed of multiple sub-discriminators

processing speech samples at different sampling rates. Even if the training is in the early stage, the sub-discriminators at low-sampling rates can not easily differentiate the real samples from the fake, which contributes to stabilize the training. Empirical results showed that the proposed generator and discriminator were successful in stabilizing GAN training and outperformed the conventional GAN-based speech enhancement techniques. The main contributions of this paper are summarized as follows:


The rest of the paper is organized as follows: In Section 2, we introduce GAN-based speech enhancement. In Section 3, we present the progressive generator and multi-scale discriminator. Section 4 describes the experimental settings and performance measurements. In Section 5, we analyze the experimental results. We draw conclusions in Section 6.

#### **2. GAN-Based Speech Enhancement**

An adversarial network models the complex distribution of the real data via a twoplayer mini-max game between a generator and a discriminator. Specifically, the generator takes a randomly sampled noise vector *z* as input and produces a fake sample *<sup>G</sup>*(*z*) to fool the discriminator. On the other hand, the discriminator is a binary classifier that decides whether an input sample is real or fake. In order to generate a realistic sample, the generator is trained to deceive the discriminator, while the discriminator is trained to distinguish between the real sample and *<sup>G</sup>*(*z*). In an adversarial training process, the generator and the discriminator are alternatively trained to minimize their respective loss functions. The loss functions for the standard GAN can be defined as follows:

$$L\_G = \mathbb{E}\_{z \sim \mathbb{P}\_z(z)}[\log(1 - D(G(z)))],\tag{1}$$

$$L\_D = -\mathbb{E}\_{\mathbf{x} \sim \mathbb{P}\_{\text{clan}}(\mathbf{x})} [\log(D(\mathbf{x}))] - \mathbb{E}\_{z \sim \mathbb{P}\_z(z)} [\log(1 - D(G(z)))] \tag{2}$$

where *z* is a randomly sampled vector from <sup>P</sup>*z*(*z*) which is usually a normal distribution, and <sup>P</sup>*clean*(*x*) is the distribution of the clean speech in the training dataset.

Since GAN was initially proposed for unconditional image generation that has no exact target, it is inadequate to directly apply GAN to speech enhancement which is a regression task to estimate the clean target corresponding to the noisy input. For this reason, GAN-based speech enhancement employs a conditional generation framework [27] where both the generator and discriminator are conditioned on the noisy waveform *c* that has the clean waveform *x* as the target. By concatenating the noisy waveform *c* with the randomly sampled vector *z*, the generator *G* can produce a sample that is closer to the clean waveform *x*. The training process of the cGAN-based speech enhancement is shown in Figure 1a, and the loss functions of the cGAN-based speech enhancement are

$$L\_G = \mathbb{E}\_{z \sim \mathbb{P}\_z(z), \mathbf{c} \sim \mathbb{P}\_{\text{naive}}(\mathbf{c})} [\log(1 - D(G(z, \mathbf{c}), \mathbf{c}))],\tag{3}$$

$$L\_D = -\mathbb{E}\_{\mathbf{x}\sim\mathbb{P}\_{\text{class}}(\mathbf{x}), \mathbf{c}\sim\mathbb{P}\_{\text{naive}}(\mathbf{c})} [\log D(\mathbf{x}, \mathbf{c})] - \mathbb{E}\_{\mathbf{z}\sim\mathbb{P}\_{\mathbf{z}}(\mathbf{z}), \mathbf{c}\sim\mathbb{P}\_{\text{naive}}(\mathbf{c})} [\log(1 - D(\mathbf{G}(\mathbf{z}, \mathbf{c}), \mathbf{c}))] \tag{4}$$

where <sup>P</sup>*clean*(*x*) and <sup>P</sup>*noisy*(*c*) are respectively the distributions of the clean and noisy speech in the training dataset.

(**a**) cGAN-based speech enhancement

(**b**) RSGAN-based speech enhancement

**Figure 1.** Illustration of the conventional GAN-based speech enhancements. In the training of cGAN-based speech enhancement, the updates for generator and discriminator are alternated over several epochs. During the update of the discriminator, the target of discriminator is 0 for the clean speech and 1 for the enhanced speech. For the update of the generator, the target of discriminator is 1 with freezing discriminator parameters. In contrast, the RSGAN-based speech enhancement trains the discriminator to measure a relativism score of the real sample *Dreal* and generator to increase that of the fake sample *Df ake* with fixed discriminator parameters.

In the conventional training of the cGAN, both the probabilities that a sample is from the real data *<sup>D</sup>*(*<sup>x</sup>*, *c*) and generated data *<sup>D</sup>*(*G*(*<sup>z</sup>*, *<sup>c</sup>*), *c*) should reach the ideal equilibrium point 0.5. However, unlike the expected ideal equilibrium point, they both have a tendency to become 1 because the generator can not influence the probability of the real sample *<sup>D</sup>*(*<sup>x</sup>*, *c*). In order to alleviate this problem, RSGAN [29] proposed a discriminator to estimate the probability that the real sample is more realistic than the generated sample. The proposed discriminator makes the probability of the generated sample *<sup>D</sup>*(*G*(*<sup>z</sup>*, *<sup>c</sup>*), *c*) increase when that of the real sample *<sup>D</sup>*(*<sup>x</sup>*, *c*) decreases so that both the probabilities could stably reach the Nash equilibrium state. In [26], the experimental results showed that, compared to other conventional GAN-based speech enhancements, the RSGANbased speech enhancement technique improved the stability of training and enhanced the speech quality. The training process of the RSGAN-based speech enhancement is given in Figure 1b, and the loss functions of RSGAN-based speech enhancement can be written as:

$$L\_{\mathbb{G}} = -\mathbb{E}\_{(\mathbf{x}\_{r}, \mathbf{x}\_{f}) \sim (\mathbb{P}\_{r}, \mathbb{P}\_{f})} [\log(\sigma(\mathcal{C}(\mathbf{x}\_{f}) - \mathcal{C}(\mathbf{x}\_{r})))],\tag{5}$$

$$L\_D = -\mathbb{E}\_{(\mathbf{x}\_r, \mathbf{x}\_f) \sim (\mathbb{P}\_r, \mathbb{P}\_f)} [\log(\sigma(\mathbf{C}(\mathbf{x}\_r) - \mathbf{C}(\mathbf{x}\_f)))] \tag{6}$$

where the real and fake data-pairs are defined as *xr* - (*<sup>x</sup>*, *c*) ∼ P*r* and *xf* - (*G*(*<sup>z</sup>*, *<sup>c</sup>*), *c*) ∼ P*f* , and *<sup>C</sup>*(*x*) is the output of the last layer in discriminator before the sigmoid activation function *<sup>σ</sup>*(·), i.e., *<sup>D</sup>*(*x*) = *<sup>σ</sup>*(*C*(*x*)).

In order to stabilize GAN training, there are two penalties commonly used: A gradient penalty for discriminator [28] and *L*1 loss penalty for generator [24]. First, the gradient penalty regularization for discriminator is used to prevent exploding or vanishing gradients. This regularization penalizes the model if the *L*2 norm of the discriminator gradient moves away from 1 to satisfy the Lipschitz constraint. The modified discriminator loss functions with the gradient penalty are as follows:

$$L\_{GP}(D) = \mathbb{E}\_{\vec{\mathbf{x}}, \mathbf{c} \sim \overline{\mathbb{P}}} \Big[ \left( || \, \big|\, \bigtriangledown \, \mathcal{C}(\vec{\mathbf{x}}, \mathbf{c}) || \, | \, | \, | \, -1)^2 \Big] \Big],\tag{7}$$

$$L\_{D-GP}(D) = -\mathbb{E}\_{\left(\mathbf{x}\_r, \mathbf{x}\_f\right) \sim \left(\mathbb{P}\_r, \mathbb{P}\_f\right)}[\log\left(\sigma(\mathsf{C}(\mathbf{x}\_r) - \mathsf{C}(\mathbf{x}\_f))\right)] + \lambda\_{\mathrm{GP}} L\_{\mathrm{GP}}(D) \tag{8}$$

where P is the joint distribution of *c* and *x* = *x* + (1 − )*x*ˆ, is sampled from a uniform distribution in [0, 1], and *x*ˆ is the sample from *<sup>G</sup>*(*<sup>z</sup>*, *c*). *λGP* is the hyper-parameter that controls the gradient penalty loss and the adversarial loss of the discriminator.

Second, several prior studies [22–24] found that it is effective to use an additional loss term that minimizes the *L*1 loss between the clean speech *x* and the generated speech *<sup>G</sup>*(*<sup>z</sup>*, *c*) for the generator training. The modified generator loss with the *L*1 loss is defined as

$$L\_1(G) = \|G(z, c) - x\|\_{1\prime} \tag{9}$$

$$L\_{G-L\_1}(G) = -\mathbb{E}\_{(\mathbf{x}\_r, \mathbf{x}\_f) \sim (\mathbb{P}\_r, \mathbb{P}\_f)}[\log(\sigma(\mathsf{C}(\mathbf{x}\_f) - \mathsf{C}(\mathbf{x}\_I)))] + \lambda\_{L\_1} L\_1(G) \tag{10}$$

where ·1 is *L*1 norm, and *<sup>λ</sup>L*1 is a hyper-parameter for balancing the *L*1 loss and the adversarial loss of the generator.

#### **3. Multi-Resolution Approach for Speech Enhancement**

 

In this section, we propose a novel GAN-based speech enhancement model which consists of a progressive generator and a multi-scale discriminator. The overall architecture of the proposed model is shown in Figure 2, and the details of the progressive generator and the multi-scale discriminator are given in Figure 3.

**Figure 2.** Overall architecture of the proposed GAN-based speech enhancement. The up-sampling block and the multiple discriminators *Dn* are newly added, and the rest of the architecture is the same as that of [26]. The components within the dashed line will be addressed in Figure 3.

**Figure 3.** Illustration of the progressive generator and the multi-scale discriminator. Subdiscriminators calculate the relativism score *Dn*(*Gn*, *xn*) = *σ*(*Cn*(*xrn* ) − *Cn*(*xfn* )) at each layer. The figure is the case when *p*, *q* = 4*k*, but it can be extended for any *p* and *q*. In our experiment, we covered that *p* and *q* were from 1*k* to 16*k*.

## *3.1. Progressive Generator*

Conventionally, GAN-based speech enhancement systems adopt U-Net generator [22] which is composed of two components: An encoder *Genc* and a decoder *Gdec*. The encoder *Genc* consists of repeated convolutional layers to produce compressed latent vectors from a noisy speech, and the decoder *Gdec* contains multiple transposed convolutional layers to restore the clean speech from the compressed latent vectors. These transposed convolutional layers in *Gdec* are known to be able to generate low-resolution data from the compressed latent vectors, however, the capability to generate a high-resolution data is severely limited [30]. Especially in the case of speech data, it is difficult for the transposed convolutional layers to generate the speech with a high-sampling rate because it should cover a wide frequency range.

Motivated from the progressive GAN, which starts with generating low-resolution images and then progressively increases the resolution [30,31], we propose a novel generator that can incrementally widen the frequency band of the speech by applying an up-sampling block to the decoder *Gdec*. As shown in Figure 3, the proposed up-sampling block consists of 1D-convolution layers, element-wise addition, and liner interpolation layers. The up-sampling block yields the intermediate enhanced speech *Gn*(*<sup>z</sup>*, *c*) at each layer through the 1D convolution layer and element-wise addition so that the wide frequency band of the clean speech is progressively estimated. Since a sampling rate is increased through the linear interpolation layer, it is possible to generate the intermediate enhanced speech at the higher layer while maintaining estimated frequency components at the lower layer. This incremental process is repeated until the sampling rate reaches the target sampling rate which is 16*kHz* in our experiment. Finally, we exploit the down-sampled clean speech *xn* processed by low-pass filtering and decimation as the target for each layer to provide multi-resolution loss functions. We define the real and fake data-pairs at different

sampling rates as *xrn* - (*xn*, *cn*) ∼ P*rn* and *xfn* - (*Gn*(*<sup>z</sup>*, *<sup>c</sup>*), *cn*) ∼ <sup>P</sup>*fn* , and the proposed multi-resolution loss functions with *L*1 loss are given as follows:

$$\begin{split} L\_{G}(p) &= \sum\_{\begin{subarray}{c} n \geq p \\ n \in \mathcal{N}\_{G} \end{subarray}} L\_{G\_{n}} + \lambda\_{L\_{1}} L\_{1}(\mathbb{G}\_{n}), \mathcal{N}\_{G} \in \{1k, 2k, 4k, 8k, 16k\}, \\ &= \sum\_{\begin{subarray}{c} n \geq p \\ n \in \mathcal{N}\_{G} \end{subarray}} -\mathbb{E}\_{\{\mathbf{x}\_{n}, \mathbf{x}\_{f\_{n}}\} \sim \left(\mathbb{P}\_{n}, \mathbb{P}\_{f\_{n}}\right)} [\log(\sigma(\mathsf{C}\_{n}(\mathbf{x}\_{f\_{n}}) - \mathsf{C}\_{n}(\mathbf{x}\_{n})))] + \lambda\_{L\_{1}} \|\mathsf{G}\_{n}(\mathbf{z}, \mathsf{c}) - \mathsf{x}\_{n}\|\_{1} \end{split} \tag{11}$$

where *NG* is the possible set of *n* for the proposed generator, and *p* is the sampling rate at which the intermediate enhanced speech is firstly obtained.

#### *3.2. Multi-Scale Discriminator*

When generating high-resolution image and speech data in the early stage of training,it is hard for the generator to produce a realistic sample due to the insufficient model capacity. Therefore, the discriminator can easily differentiate the generated samples from the real samples, which means that the real and fake data distributions do not have substantial overlap. This problem often causes training instability and even mode collapses [30].For the stabilization of the training, we propose a multi-scale discriminator that consists of multiple sub-discriminators treating speech samples at different sampling rates.

As presented in Figure 3, the intermediate enhanced speech *Gn*(*<sup>z</sup>*, *c*) at each layer restores the down-sampled clean speech *xn*. Based on this, we can utilize the intermediate enhanced speech and down-sampled clean speech as the input to each sub-discriminator *Dn*. Since each sub-discriminator can only access limited frequency information depending on the sampling rate, we can make each sub-discriminator solve different levels of discrimination tasks. For example, discriminating the real from the generated speech is more difficult at the lower sampling rate than at the higher rate. The sub-discriminator at a lower sampling rate plays an important role in stabilizing the early stage of the training. As the training progresses, the role shifts upwards to the sub-discriminators at higher sampling rates. Finally, the proposed multi-scale loss for discriminator with gradient penalty is given by

$$\begin{split} L\_{D}(q) &= \sum\_{\substack{n \geq \eta \\ n \in \mathbb{N}\_{D}}} L\_{D\_{n}} + \lambda\_{GP} L\_{GP}(D\_{n}), \lambda\_{D} \in \{1k, 2k, 4k, 8k, 16k\}, \\ &= \sum\_{\substack{n \geq \eta \\ n \in \mathbb{N}\_{D}}} -\mathbb{E}\_{\left(\mathbf{x}\_{n}, \mathbf{x}\_{f\_{n}}\right) \sim \left(\mathbb{P}\_{\eta, \mathbb{P}\_{f\_{n}}}\right)} \left[ \log \left(\sigma(\mathsf{C}\_{n}(\mathbf{x}\_{n}) - \mathsf{C}(\mathbf{x}\_{f\_{n}})) \right) \right] + \lambda\_{GP} \mathbb{E}\_{\left(\mathbf{x}\_{n}, \mathbf{c}\_{n} \sim \overline{\mathbf{P}\_{n}}\right)} \left[ \left( || \operatorname{\boldsymbol{\varPi}\_{\widetilde{\mathrm{X}}, \mathcal{E}\_{n}} \operatorname{\boldsymbol{\mathsf{C}}}(\widetilde{\mathbf{x}\_{n}}, \mathbf{c}\_{n}) || \boldsymbol{\upbeta}\_{2} - 1 \right)^{2} \right] \end{split} \tag{12}$$

where P *n* is the joint distribution of the down-sampled noisy speech *cn* and *xn* = *xn* + (1 − )*x*<sup>ˆ</sup>*n*, is sampled from a uniform distribution in [0, 1], *xn* is the downsampled clean speech, and *<sup>x</sup>*<sup>ˆ</sup>*n* is the sample from *Gn*(*<sup>z</sup>*, *c*). *ND* is the possible set of *n* for the proposed discriminator, and *q* is the minimum sampling rate at which the intermediate enhanced output was utilized as the input to a sub-discriminator for the first time. The adversarial losses *LDn*are equally weighted.

#### **4. Experimental Settings**
