*4.1. Dataset*

We used a publicly available dataset in [34] for evaluating the performance of the proposed speech enhancement technique. The dataset consists of 30 speakers from the Voice Bank corpus [35], and used 28 speakers (14 male and 14 female) for the training set (11572 utterances) and 2 speakers (one male and one female) for the test set (824 utterances). The training set simulated a total of 40 noisy conditions with 10 different noise sources (2 artificial and 8 from the DEMAND database [36]) at signal-to-noise ratios (SNRs) of 0, 5, 10, and 15 dB. The test set was created using 5 noise sources (living room, office, bus, cafeteria, and public square noise from the DEMAND database), which were different from

the training noises, added at SNRs 2.5, 7.5, 12.5, and 17.5 dB. The training and test sets were down-sampled from 48 kHz to 16 kHz.

#### *4.2. Network Structure*

The configuration of the proposed generator is described in Table 1. We used the U-Net structure with 11 convolutional layers for the encoder *Genc* and the decoder *Gdec* as in [22,26]. Output shapes at each layer were represented by the number of temporal dimensions and feature maps. Conv1D in the encoder denotes a one-dimensional convolutional layer, and TrConv in the decoder means a transposed convolutional layer. We used approximately 1 s of speech (16384 samples) as the input to the encoder. The last output of the encoder was concatenated with a noise which had the shape of 8 × 1024 randomly sampled from the standard normal distribution *N*(0, 1). In [27], it was reported that the generator usually learns to ignore the noise prior *z* in the CGAN, and we also observed a similar tendency in our experiments. For this reason, we removed the noise from the input, and the shape of the latent vector became 8 × 1024. The architecture of *Gdec* was a mirroring of *Genc* with the same number and width of the filters per layer. However, skip connections from *Genc* made the number of feature maps in every layer to be doubled. The proposed up-sampling block *Gup* consisted of 1D convolution layers, element-wise addition operations, and linear interpolation layers.

**Table 1.** Architecture of the proposed generator. Output shape represented temporal dimension and feature maps.


In this experiment, the proposed discriminator had the same serial convolutional layers as *Genc*. The input to the discriminator had two channels of 16,384 samples, which were the clean speech and enhanced speech. The rest of the temporal dimension and featuremaps were the same as those of *Genc*. In addition, we used LeakyReLU activation function without a normalization technique. After the last convolutional layers, there were a 1 × 1 convolution, and its output was fed to a fully-connected layer. To construct the proposed multi-scale discriminator, we used 5 different sub-discriminators, which were *D*16*k*, *D*8*k*, *D*4*k*, *D*2*k*, *andD*1*k* trained according to in Equation (12). Each sub-discriminator had a different input dimension depending on the sampling rate.

The model was trained using the Adam optimizer [37] for 80 epochs with 0.0002 learning rate for both the generator and discriminator. The batch size was 50 with 1-s audio signals that were sliced using windows of length 16,384 with 8192 overlaps. We also applied a pre-emphasis filter with impulse response [−0.95, 1] to all training samples. For inference, the enhanced signals were reconstructed through overlap-add. The hyper-parameters to balance the penalty terms were set as *<sup>λ</sup>L*1 = 200 and *λGP* = 10 such that they could match the dynamic range of magnitude with respect to the generator and discriminator losses. Note that we gave the same weight to the adversarial losses, *LGn* and *LDn* , for all *n* ∈ {1*k*, 2*k*, 4*k*, 8*k*, <sup>16</sup>*k*}. We implemented all the networks using Keras with Tensorflow [38] back-end using the public code (The SERGAN framework is available at https://github .com/deepakbaby/se\_relativisticgan). All training was performed on single Titan RTX 24 GB GPU, and it took around 2 days.

#### *4.3. Evaluation Methods*

4.3.1. Objective Evaluation

> The quality of the enhanced speech was evaluated using the following objective metrics:

